Are Python Dictionaries Ordered Data Structures?

2 days ago (thepythoncodingstack.com)

I mean this gets to the fundamental question of what being ordered actually means. The author decides that regardless of whether the data structure has an order, even if that order is guaranteed, it's a necessary property that order must be taken into account to determine equality. A very sensible and pure definition that says that equivalence classes should be single elements.

But this is a weird thing to actually do in Python, it's not a common thing to test dictionaries for equality like this. If you parsed some file into a dictionary and you want to preserve the order it was originally in when you write it back out then the standard dict works just fine. There's a tension in the article that if you rely on ordering you ought to consider using OrderedDict instead but that leads you to never rely on dict() ordering which is fine— but it's guaranteed to you. You're supposed to use it! The standard dict() even has a __reversed__ method so ordering is a meaningful property.

  • I think it was a mistake to standardize that dict will maintain insertion order. It's a nice side effect of the internal implementation, but I fear it will hinder future improvements.

    • The problem is that once this side effect is in the main version for a while, it becomes a feature whether you admit it or not. Since that is going to happen anyway, you might as well make it official.

      The alternative approach is to do what Go did and explicitly randomize iteration order to prevent people from relying on a fixed order in the first place.

      1 reply →

    • This is a genuine concern, since it hinders our ability to port over high-quality, high-performance hash table implementations from other languages (since these often do not preserve any human ordering.)

      However, the ship has already sailed here. I think that once insertion-ordering became the standard, this creates a guarantee that we can't easily back down from.

      4 replies →

    • I don't believe it was because some side effect became documented.

      It was because they wanted the order of the names in a class definition to match the order the were declared. The same dict implementation is used everywhere, so the standard dict acquired the behaviour.

This post contains a key misconception about the Python builtin data structures, that may seem like sophistry but is key to understanding the semantics (and, thus, most fluent use) of these tools.

All of the Python builtin data structures are ordered.

The distinction we should make is not between ordered and unordered data structures. Instead, we should distinguish between human ordered and machine ordered data structures.

In the former, the data structure maintains an ordering that a human being can used as part of their understanding of the programme. A `list` is human-ordered (and its order typically connotes “processing” order,) a `tuple` is human-ordered (and its order typically connotes “semantic” ordering, which is why `sorted(…)` and `reversed(…)` is rarely a meaningful operation,) a `str` is human-ordered, and `int` is ordered (if we consider `int` in Python to be a container type, despite our inability to easily iterate over its contents. Whether or not `complex` is a container or not, is pushing this idea a bit too far, in part because I don't think anyone really uses `complex`, since NumPy dtype='complex128' is likely to be far more useful in circumstances where we're working within .)

In the latter, the data structure maintains an ordering that a human being cannot use as part of their understanding of a programme (usually as a consequence of a mechanism that the machine uses as part of its execution of the programme.) A `set` is machine-ordered, not unordered. If we iterate over a `set` multiple times in a row, we see the same ordering (even though we cannot predict this ordering.) In fact, the ordering of a `set` is intentionally made difficult for a human being to predict or use, by means of hash “salting”/seeding (that can only be controlled externally via, e.g., the https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHON... `PYTHONHASHSEED` environment variable.)

Historically, the Python `dict` was machine ordered. If we looped over a `dict` multiple times in a row (without changes made in between,) we were guaranteed a consistent ordering. In fact, for `dict`, the guarantee of consistency in this ordering was actually useful: we were guaranteed that `[d]` and `[d.values()]` on a `dict` (with no intervening changes) would maintain the same correspondence order (thus `[zip(d, d.values())]` would match exactly to `[d.items()]`!)

When the split-table optimisation was added to Python, the Python `dict` became a very interesting structure. Note that, from a semantic perspective, there are actually two distinct uses of `dict` that we see in use: as a “structural” or as “data” entity. (Ordering is largely meaningless for the former, so we'll ignore it for this discussion.) When the split-table optimisation was added in Python, the underlying storage for the `dict` became two separate C-level blocks of contiguous memory, one of which was machine-ordered (in hash-subject-to-seeding-and-probing/perturbation order) and one of which was human-ordered (in insertion order.) (From this perspective, we could argue that a `dict` is both human and machine-ordered, though it stands to reason that the only useful artefact we see of the latter is with `__eq__` behaviour, which this article discusses. Since “human ordering” is a guarantee, it supersedes “machine ordering.”)