linked lists in Tcl, part II.

Adam Sah (asah@ginsberg.CS.Berkeley.EDU)
17 Oct 1994 13:18:05 GMT

I have received independent pieces of email from Tom Tromey and Matt Dillon
pointing out an error in my analysis of Tcl's ability to support linked
lists. In forming my response, I concentrated on the plethora of posts
attempting to claim that Tcl lists are the same as linked lists; in
fact, I failed to consider Tcl's hash tables properly, on which both
authors use their counterexample-based arguments. However,
both proposed designs fail on several accounts: first, they don't take into
accounts the memory mamagement issues involved (ie. leaking memory).
Second, they don't consider the scoping issues (ie. given a "reference",
can I really access my array element?). Lastly, they neglect the detail
that sublist creation in a hash table implementation is no longer O(1)
in time complexity. Although I don't say it explicitly, the sorts of
semantic problems are non-issues in languages like Scheme, which have
real data references and real garbage collection.

Here is an excerpt of Dillon's response, which was posted to comp.lang.scheme:
> You implement the linked list using arrays. Arrays in TCL are not
> only associative, but also use hash lookups. Therefore, they are
> very close to O(1).
>
> set LIST(root,next) root
> set LIST(root,prev) root
> set LIST(root,data) whatever
>
> The symbolic name you choose for the elements does not matter... for
> example, simply use an integer ID. All you need to do is guarentee
> uniqueness.
>
> Using this methodology, you can insert elements into the 'linked list'
> anywhere in O(1) time (ok, ok, O(N/4000) or whatever the array access
> time is, but that is a minor issue).

and here is an excerpt from Tromey's:
> Suppose a "linked list" is implemented using a Tcl array. The keys of
> the array are arbitrary and are chosen by the linked list code. Each
> value in the array is a list of 2 elements -- the first being another
> index, and the second being the value the user supplies.
>
> Further suppose that all references (external to the package that
> implements the linked list) to the linked list are lists of 2 elements
> -- the name of the array and the name of the head index.
>
> * There is no size limit on the collection.
> * Values can be of any "data type", which in Tcl just means they can
> be any value.
> * Order remains the same under insertion & deletion.
> * Lookup does require O(n).
> * Given a valid index into the list, you can construct a sublist by
> consing the name & the index.
> * Sublists share data with the original list.

First, these proposals don't account for memory management. For example,
let's imagine the operation where we destructively modify the linked list
to "skip" cartin elements in the middle of a list. Assuming we don't
want to write a specialty function to handle this particular case, we
have to handcode the memory management functions to free the "dropped"
elements in the middle, lest they become unreachable.

The generalization of the problem is that Tcl doesn't offer garbage
collection, and so it becomes diffcult to maintain the invariants that
allow Tcl's automatic memory manager to find unreachable data, as when
you overwrite a variable with a new value. For example, this is the
reason Tcl doesn't have "pointers" or other references; if it did, you'd
quickly run into cases when you couldn't trivially reclaim unreachable data.
Requiring free() call would obviously be in conflict with its "very high
level" nature.

To language theorists, this is a subset of the escape problem, where an
entity "escapes" its automatic scope, where reclamation would be trivial.
Right now, Tcl is careful to avoid this problem; when you allow
references, more sophisticated garbage collection schemes are required.
For example, consider the following dilemmas posed by Tromey's proposal:
- how can an element appear in two linked lists at once? (answer:
it's not enough that external references to l-l elements keep the
original-list-name; all elements in all lists now need them) What
happens if the element is deleted out of the original list, but not
out of the copies?
- how do I return a linked list as the return value from a proc?
how do I return a sublist as the return value from a proc?
- if a locally-scoped linked list contains a reference to an element in
another linked list, and the first list goes out of scope, what happens
to the pointed-to-element? (assume both cases: (a) this is the last
reference to it, and (b) this is _not_ the last reference to it)

And sure, you can play games about what constitutes a valid linked list
element, but I don't consider it reasonable to turn all possible
sublists into first-class lists- what would you name them? Meanwhile,
you haven't solved your garbage collection problem. (and if we let you
have circular lists, or even references to lists embedded in list
elements, then reference counting is out, as well)

Second, in Tromey's proposal, we use the STRING names of linked lists as
references to them. But what about a function whose return value is an
element of a linked list scoped locally to that function? The Tcl
memory manager will try to reclaim your array variable, making your
pointer invalid- just like if you returned the address of a local
variable in C. Again, a real garbage collector would help solve this,
although I'd be shocked if someone invented a collector that could treat
string names as references to objects. Some more rhetorical questions:
- what happens if we try to nest a linked list inside of another? How
do we name it? How do we reference it?
- if I pass an {array, index} pair to a proc, and it doesn't upvar to
bring the array into its scope, but instead passes it onto a lower
scope, how is the ultimate receiver supposed to know how many levels
of "upvar" it needs to do to find the array variable?

Third, we should keep in mind that sublist creation in both proposals is no
longer O(1), but O(n). For many applications, this doesn't matter, but
remember that _any_ expression resulting in a list creates a new
sublist.

I think it's instuctive to step back from this conversation a bit.
Language semantics are _hard_- many very brilliant people have worked for
the past 65 years on formal language theory, and I'm firmly convinced
that this hasn't been a masturbatory or irrelevant pursuit. I beg the
pundits to consider:

- why do virtually all languages proposed by language researchers contain
garbage collectors, especially if they've historically been so expensive?
(they're much, much cheaper now- see studies like Zorn's [ZORN92])
[hint: see the next question]

- what's the big deal about closures? Why do people love them?
[hint: Tcl users- try passing the "after" command in Tk a nontrivial
block of statements, and assume that you want to reference a variable
local to the scope of the caller of "after". Assume that multiple "after"s
are outstanding (ie. the calling proc can be called before its "after"
code is serviced, so static variables don't work. Now try it with
closures- amazing how much easier it is...).

- why does Lisp/Scheme tag data with type information? Why doesn't ML?
What are the tradeoffs? (hint: here's some code I tried to write in
response to the above replies... this function is supposed to create
a new hash table and set it up):

# no reasonable way to check if "l" is already something other than an array
# in the caller's scope.
proc linked-list {listName} {
upvar 1 $listName l ;kludgy hack for pass-by-reference
set $l(TYPE) "linked-list"
set $l(NEXT-ITEM-ID) 0
set $l(HEAD) ""
set $l(TAIL) ""
set $l(SIZE) 0
}

- why are Scheme lists implemented as cons-pairs? Aside from the problem
of circular lists, why don't they store their length with the head of
the list?

At the high level, it is naive and dangerous to try to oversimply the
problems here. The fact that the language theorists haven't packaged
their work with pleasing syntax is not a valid argument to reinvent the
wheel as a five-sided "square". Think OSs: the fact that Unix command
names are cryptic isn't a good excuse to assign 640K memory limits.
You should fix the syntax if that's what's broken.

Adam Sah
UC Berkeley Dept. of Computer Science
Berkeley, CA 94720-1776
asah@cs.berkeley.edu

References
--------------------

[ZORN92] Benjamin Zorn. "The Measured Cost of Conservative Garbage
Collection." Univ. of Colorado at Boulder Technical Report #573-92
ftp://cs.colorado.edu:/pub/techreports/zorn/CU-CS-573-92.ps.Z