Comment by rurban

17 hours ago

Don't compare apples to oranges. unordered_map is so slow because it has to guarantee pointer stability, doing seperate chaining, whilst the open addressing hashtables doing probing and moving do not. They are at least 2x faster.

Compare to linear probing, quadratic probing, double hashing, cuckoo, or swiss tables.

This seems like it has causality upside down. The pointer stability and weird bucketing API for std::unordered_map happen to be consequence of its design, and you're stupidly allowed to rely on them, and so it can't be trivially replaced with a better hash table.

But most people didn't want this. std::unordered_map is just a more modern example of WG21's ability to clutch failure from the jaws of success than say std::vector<bool>

The approach taken in std::unordered_map was a historical footnote when C++ 98 shipped, but what's crazy is that it didn't ship in C++ 98 it's from C++ 11. The ISO document specifying this archaic hash table design as the standard C++ hash table is from the Obama Presidency like Rust 1.0.

So while maybe it's unfair in one sense, in another sense it makes absolute sense to compare against std::unordered_map because this is what you get out of the box.

  • > So while maybe it's unfair in one sense, in another sense it makes absolute sense to compare against std::unordered_map because this is what you get out of the box.

    That's not really a reason to omit comparisons with other better implementations.

  • > [pointer stability] you're stupidly allowed to rely on them

    that's a bit unfair. Those are properties that can be very useful if you need them.

    • Indeed. One of the main reasons STL defaults to a non-moving hash table is because "deleting in the middle of an iteration" was viewed as an important property. Like it or not (I do not), the entire container part of the original 1990s Stepanov STL was oriented around "iteration/iterators" as the high priority abstraction.

      Trying to think like a DB engineer is probably helpful here. Point or range query workloads absolutely afford different optimization/specializations than full table scan iterations.

      Any mistake here is more in thinking "one size fits all workloads". That may as much be a mistake of user communities as of language designers. (E.g. 'expert' advice like "just always use X" instead of "well, it depends...".) So, the charitable reading of tialaramex's "most people" is a weakly articulated assertion about workloads (though I absolutely agree with your specific point).

      3 replies →