← Back to context

Comment by jamesdutc

3 days ago

This is a genuine concern, since it hinders our ability to port over high-quality, high-performance hash table implementations from other languages (since these often do not preserve any human ordering.)

However, the ship has already sailed here. I think that once insertion-ordering became the standard, this creates a guarantee that we can't easily back down from.

The python dict() is already high-quality and high-performance. You're probably not going to be able to do much better than the current state. If you want faster you end up moving your data into an environment that lets you make more assumptions and be more restrictive with your data and operate on it there. It's how numpy, polaris, and pandas work.

Everything in Python is a dict, there's no data structure that's been given more attention.

https://m.youtube.com/watch?v=p33CVV29OG8

  • > The python dict() is already high-quality and high-performance

    Yes, the CPython `PyDictObject` has been subject to a lot of optimisation work, and it is both high-quality and high-performance. I should not have implied that this is not the case.

    However, there's a lot of ongoing research into even further improving the performance of hash tables, and there are regular posts discussing the nature of these kinds of improvements: e.g., https://news.ycombinator.com/item?id=17176713

    I have colleagues who have wanted to improve the performance of their use of `dict` (within the parts of their code that are firmly within the structural/Python domain,) who have wanted to integrate these alternate implementations. For the most part, these implementations do not guarantee “human ordering” so this means that they can provide these tools only as supplements (and not replacements) of the Python built-in `dict`.

    > moving your data into an environment that lets you make more assumptions and be more restrictive with your data and operate on it there. It's how numpy, polaris, and pandas work.

    Yes, the idea of a heterogeneous topology of Python code, wherein “programme structuring” is done in pure Python, and “computation” is done in aggregate types that lower to C/C++/Rust/Zig (thus eliminating dynamic dispatch, ensuring contiguity, &c.) is common in Python. As you note, this is the pattern that we see with NumPy, Polars, pandas, and other tools. We might put a name to this pattern: the idea of a “restricted computation domain.” (I believe I introduced this terminology into wide-use within the Python community.)

    However, not all code shows such a stark division between work that can be done at high-generality (and, correspondingly, low-performance) in pure Python and work that can be done at low-generality (but very high-performance) within a “restricted computation domain.) There are many types of problems wherein the line between these are blurred, and it is in this region where improvements to the performance of pure Python code may be desired.

    • > they can provide these tools only as supplements (and not replacements) of the Python built-in `dict`.

      Maybe they’ll walk back on that if there is a compelling new implementation that isn’t ordered. Or they keep dict as is and use the better implementation for internal dict-like structures.

      1 reply →