Comment by matheusmoreira

1 year ago

> Why are arrays tables

I might be able to provide at least some insight into that.

One of the things I found while implementing my language was that the ordering of elements in a hash table is essentially random. Literally random if you mix in some randomness into the hash function's initial state in order to harden it against collision attacks.

Many languages guarantee insertion ordering. It's a useful property to have, makes testing easy and deterministic and it's intuitive for users. A simple way to implement that is to put a perfectly normal array of values into the hash table, and then have the buckets which the keys hash into contain indexes into that array. That way the indexes of values remain stable no matter what the keys hash to.

So a sufficiently advanced hash table will have an array of values inside it. I suppose some language designers took one look at that situation and decided to just merge both types already by using hash tables for everything. For example, PHP also did that. Associative arrays, they call it.

CPython's hash table (dict) evolved to have an array of values indexed by a hash table of indices for performance reasons, but later decided to guarantee the insertion ordering that this permitted for iteration.

"Associative array" is just the term used for hash tables (dicts) in Awk, bash, Tcl, and, most relevantly for PHP, Perl4. It doesn't specifically mean that the language lacks a separate container type for small-integer-indexed arrays. It just means that that particular container type can be indexed by something other than small integers, usually strings.

  • > CPython's hash table (dict) evolved to have an array of values indexed by a hash table of indices for performance reasons

    In case anyone is curious about it:

    https://mail.python.org/pipermail/python-dev/2012-December/1...

    > In addition to space savings, the new memory layout makes iteration faster.

    > Currently, keys(), values, anditems() loop over the sparse table, skipping-over free slots in the hash table.

    > Now, keys/values/items can loop directly over the dense table, using fewer memory accesses.

    The "dense table" is the array of values. The "sparse table" is the actual hash table, only the values index into the values array.

    I thought the insertion ordering was implemented at around the same time as a result of it. The data structure ensures it pretty much for free, seemed like the obvious next step. It appears I was mistaken about that.

    • They did implement insertion ordering at that time, but they didn't want to commit to adding it to the official language semantics until having enough experience to know whether they wanted to back out the change to the implementation which provided that guarantee for free.

  • Yes, this name is basically a reflection of the fact that a hash table is an array, indexes of which get calculated by `hashof(key) % arr.length`.

    It doesn’t mean HTs are the arrays which should be used to store seqint-indexed values.

    • No, that's an implementation detail which isn't actually true of the associative arrays in all of these languages. "Associative array" is the name for the interface, not the implementation. In theory it could be a hash trie or something, although I'm not aware of any implementations of any of these languages that uses anything but a hash table (which is, yes, fundamentally dependent on using an array).

You described how every language without hashtable-as-an-object works. The only difference is when you address fields.

But I fail to see the logic behind seeing this and merging array/table ideas. Feels like one of these irrational decisions you make in a dream.

And Lua doesn’t use hash table to store arrays. Its tables have two different parts (hash, indexed array) with foggy distribution logic which reflects into “# vs ipairs” issue. This merge is purely synthetic and doesn’t project any underlying implementation idea.

  • Merging the concepts of arrays and maps was very popular in the 1980s and 1990s. I can count at least Awk, Tcl, Lua, JavaScript and PHP as languages that went with this approach.

    Looking from today's perspective, I also fail to see the logic behind it: separating arrays and dictionaries does not make your implementation more complicated than having a single backing data structure for both and taking care of all the special cases that arise from such a monstrosity.

    The best explanation I can think of is that this was just a fad nobody questioned too much. Back in the 1980s and 1990s, there was a strong distinction between "scripting languages" and general-purpose dynamic languages (like Lisp and Smalltalk). Scripting languages were supposed to be used to write small scripts to automate stuff. Type-safety was not something that any mainstream language considered back then (it was a thing in niche languages like Ada, SML and Eiffel), and speed was not a concern for scripting languages.

    A common philosophy (though I never seen it clearly stated) for scripting languages seems to have been to "be liberal and try to work with whatever the programmer does".

    So you shouldn't require to declare variables in advance (that's a lot of hassle!) or throw an error if a function is called with the wrong number of arguments.

    And what if the programmer tries to add an integer and a number written as string? Just coerce them. Or better yet, make everything a string like Tcl does. Or at the very least, don't be a spoilsport and have separate types for integers and floating point numbers!

    Ok, so what if the programmer tries to use a string index? We cannot coerce non-numeric strings! Well, why don't we just unify arrays and dictionaries like we've just unified all our number types? Now you don't have to worry about invalid indexes since all indexes are converted to strings! And if the programmers want to use 0-based, 1-based, 2-based or 42-based indexes they can do whatever they want! They can even have non-contiguous arrays or mix arrays with associative dictionaries. Total freedom.

    This thought process is completely imaginary. I suspect not a lot of thought was given to this type of flexibility at all back in the day. It was just a widely accepted assumption that scripting languages should be flexible, free and relaxed. And merging arrays and dictionaries (or "all arrays are associative arrays" as it was often called back then) was just one more idea that naturally fit into the zeitgeist, like type coercion and "assignment to an unknown variable sets a global variable by default".

    So tldr: it was a fad.

    • > Tcl

      Not to my understanding. What is called "array" in the language is indeed associative, but lists (heterogeneous vectors, like Python's) have always been here and distinct.

      But otherwise, insightful post. Awk is also the first thing that came to mind, with its funky arrays and "undefined variables are initialized on first use to the empty string or 0 depending on the context".

    • This fad was also helped, I think, by the memory architectures of the day. Without modern multilevel caches and ridiculously expensive cache misses, it might have seemed reasonable to converge on a one-size-fits-all structure. The cost of jumping around a relatively compact structure like a hash table probably seemed immaterial at the time.