Comment by jamesdutc

3 days ago

This post contains a key misconception about the Python builtin data structures, that may seem like sophistry but is key to understanding the semantics (and, thus, most fluent use) of these tools.

All of the Python builtin data structures are ordered.

The distinction we should make is not between ordered and unordered data structures. Instead, we should distinguish between human ordered and machine ordered data structures.

In the former, the data structure maintains an ordering that a human being can used as part of their understanding of the programme. A `list` is human-ordered (and its order typically connotes “processing” order,) a `tuple` is human-ordered (and its order typically connotes “semantic” ordering, which is why `sorted(…)` and `reversed(…)` is rarely a meaningful operation,) a `str` is human-ordered, and `int` is ordered (if we consider `int` in Python to be a container type, despite our inability to easily iterate over its contents. Whether or not `complex` is a container or not, is pushing this idea a bit too far, in part because I don't think anyone really uses `complex`, since NumPy dtype='complex128' is likely to be far more useful in circumstances where we're working within .)

In the latter, the data structure maintains an ordering that a human being cannot use as part of their understanding of a programme (usually as a consequence of a mechanism that the machine uses as part of its execution of the programme.) A `set` is machine-ordered, not unordered. If we iterate over a `set` multiple times in a row, we see the same ordering (even though we cannot predict this ordering.) In fact, the ordering of a `set` is intentionally made difficult for a human being to predict or use, by means of hash “salting”/seeding (that can only be controlled externally via, e.g., the https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHON... `PYTHONHASHSEED` environment variable.)

Historically, the Python `dict` was machine ordered. If we looped over a `dict` multiple times in a row (without changes made in between,) we were guaranteed a consistent ordering. In fact, for `dict`, the guarantee of consistency in this ordering was actually useful: we were guaranteed that `[d]` and `[d.values()]` on a `dict` (with no intervening changes) would maintain the same correspondence order (thus `[zip(d, d.values())]` would match exactly to `[d.items()]`!)

When the split-table optimisation was added to Python, the Python `dict` became a very interesting structure. Note that, from a semantic perspective, there are actually two distinct uses of `dict` that we see in use: as a “structural” or as “data” entity. (Ordering is largely meaningless for the former, so we'll ignore it for this discussion.) When the split-table optimisation was added in Python, the underlying storage for the `dict` became two separate C-level blocks of contiguous memory, one of which was machine-ordered (in hash-subject-to-seeding-and-probing/perturbation order) and one of which was human-ordered (in insertion order.) (From this perspective, we could argue that a `dict` is both human and machine-ordered, though it stands to reason that the only useful artefact we see of the latter is with `__eq__` behaviour, which this article discusses. Since “human ordering” is a guarantee, it supersedes “machine ordering.”)

0 comments

jamesdutc

No comments yet

Contribute on Hacker News ↗