A new fast hash table in response to Google’s new fast hash table

8 years ago (probablydance.com)

Does this implementation stops on rehashing? We should stop considering seriously hash tables that stop for rehashing because they can be used only for a subset of problems, due to the latency. This is one of the main reasons I'm now in love with radix trees and trying to get rid of hash tables at every level in my code.

  • It's a bit funny to write off a hash table that is by design competing with another hash table by saying "it's not a different data structure, so I'm not interested."

    Moreover, this kind of optimization only makes sense if (1) latency, not throughput, is your bottleneck or (2) if you're really concerned about parallel accesses.

    A reasonable default in general is to use a hash table + mutex until that is your bottleneck. (I've yet to encounter a scenario where switching from a hash map would provide significant gains in performance-relevant code)

    Disclosure: I work at Google and use the fancy hash map discussed by Matt. I didn't work on it, though.

    • That's not a very charitable interpretation of the GP. He's not criticizing hash tables because they aren't radix trees. He's criticizing this and other hash table implementations because they have undesirable performance characteristics with respect to his needs, and noting an alternative data structure which he has found to be more suitable. I think that's fair game.

      Note that there exist hash table implementations which support incremental resizing.

      I do agree, though, that it's a leap to go from "hash tables with this property are inappropriate under these constraints" to "hash tables with this property should not be considered further", precisely because there is such a broad spectrum of data structure requirements across the industry.

      2 replies →

    • Radix trees are very nice data structure. There is one drawback I never see in comparisons: they may require non-constant amount of space for an insertion operation, i.e., they may introduce more nodes than the one that is inserted, because nodes may need to be split. This means you cannot burden the space allocation on the caller of 'insert' (and the deallocation on the caller of 'remove'), e.g. by (statically) preallocating a node cell in each item that can potentially be part of a radix tree.

      This is important in environments where you have or do not want to use malloc/free. E.g. in embedded systems, exactly where some properties like worst case time behaviour is important and, therefore, radix trees seem like an option: they still might not be.

      Well, but hash tables are not an option either in these situations, so good old binary search trees win (red-black, avl, 2-4, whatever).

  • I've found myself reaching for radix trees more and more for the same reason you listed: competitive and predictable performance.

    That said, I can see two counter-arguments:

    1. Rehashing can be avoided if the table size is known in advance. 2. Rehashing is immaterial if the data structure is write-once, read-many.

    In those cases, we might reasonably expect a hash table to perform better.

    • If the use case is write-once, read-many, you can simply build a perfect hash table. Using techniques like CHD the cost is IME usually in the range of 1x (no cost) to 3x, depending on how compact you want the table, how large it is, and how much time you want to spend up front.

      Examples from my own small library (http://25thandclement.com/~william/projects/phf.html), using 1 million random integers:

        80% load -> 0.47s
        85% load -> 0.56s
        90% load -> 0.74s
        95% load -> 1.10s
        98% load -> 1.35s
        99% load -> 1.68s
      

      If I hash the items as C++ string objects rather than uint32 scalars then the time is doubled, presumably because of the dynamic allocation. Which goes to show the "perfect" in perfect hash table doesn't need to be the dominate cost.

      That said, 90% of the time I just use a simple red-black tree--a so-called intrusive implementation. It makes memory management much easier in C, and rare is the situation where a hash table (let alone a perfect hash table) can make a real difference relative to focusing on all the other inefficiencies.

    • > 1. Rehashing can be avoided if the table size is known in advance

      With an open addressing implementation (Google's dense_hash_map, for example) this is only true if you rarely or never need to remove items from the map.

  • Are radix trees always better? I doubt it.

    • Not always better, but for most cases where people typically use a hash table it frequently is, assuming the radix tree design and implementation is excellent. There are still a few cases where well-designed hash tables are a better choice e.g. fixed size cache maps with unique keys. But to agree with the OP, there are very few hash tables in my software these days.

      The downside of radix trees is the space of algorithm designs is pretty exotic, especially high-performance implementations. It is much easier to write a good hash table implementation than a good radix tree implementation.

    • Try these string keys:

          { i * "a" : i = 1 to n }
      

      (also great for breaking string sorts)

I am essentially a hash table fetishist, and I'm a little bummed out that this doesn't go into more detail about how the damn thing works. How do you do implement a chaining hash table that uses a single flat array?

  • Like a FAT format but with relative indices. It should be pretty trivial pointer arithmetic.

    Edit: bytell_hash_map.cpp:464 bytell_hash_map.cpp:58

    Apparently there's one extra layer of indirection via a lookup table to get the final offset.

    So basically the metadata byte is a bitfield union of + status flag + status code (not sure it's actually required) or offset enum. The offset enum then indexes into a look up table, the result of which is the relative offset from the current index.

  • You can look at the source in his github repository: https://github.com/skarupke/flat_hash_map

    • I looked through the code and o_O there is a lot of C++ boilerplate. The core part of it is not really documented well, and there is this mysterious jump bits array...if I had more time, I could figure out it. But that's the point of a writeup, explaining things with a diagram, etc.

      5 replies →

    • That's not the same as having a high level description of the concepts used. For details I can still dive into the details later.

  • You do hash % bucketCount to find your bucket/slot. Each array element is also a linked list node (with a next pointer), thus if the slot is populated you follow the linked list until an empty slot is found and attach to the end of the linked list.

    I figure it works best if you over allocate slots, so that on average you only have to do very few hops on each linked list walk (or no walk at all).

    Aside: This is the method employed by .NET's Dictionary<K,V> class.

    Further - it's generally faster to use structure of arrays rather than array of structures, thus in the above scheme the dictionary items/payloads and the linked list pointers could be in separate arrays.

Sort of off topic: In theory hash tables are supposed to be O(1) (under the RAM model). But looking at these graphs, they seem to grow linearly after they stop fitting in the cache which is especially apparent on the unsuccessful lookup graph. Since it's on a log-scale graph, O(log(n)) would probably fit better.

Just thought it's an interesting divergence between what's taught in school and in practice. I wonder if there's something that's more predictive than big Oh for these types of analysis.

  • The biggest divergence between what's taught in school and what happens in practice is that the Big O limits you saw in school use a different abstract machine relative to what we use today, in practice. Specifically, I'm talking about the assumption that random memory access is O(1), where in reality modern machines' cache hierarchy makes it behave more like O(log n) or O(sqrt n) or something to that effect.

    • Is that really true, though? It's certainly the case that a cache miss takes longer than a cache hit, but the time for a cache miss doesn't depend on how much memory you have. Or does it? Like, a cache miss isn't going to be slower if you have 64 gigabytes of memory compared to having 1 gigabyte of memory, it's just a cache miss. That makes random memory access O(1) in the sense of complexity analysis.

      11 replies →

  • You should think memory as fixed length continuous blocks of different sizes.

    Inside the processor there are cache-lines (usually 64 bytes). They are blocks of memory that are are tagged by CPU at once and moved together.

    In the typical n-way set associative cache architecture the main RAM memory is divided blocks with n-lines each. Each set on the memory cache can hold up to n-lines from the same memory block. 8-way cache would divide 1 GB RAM to 1,024 1 MB blocks. If you work with more than 512 bytes (= 8X64) at the time within that block, there will be cache misses. In other words, CPU caches have have limited amount of cache lines dedicated to large continuous block of RAM (unless they are fully associative caches)

    From CPU to DRAM access there is typically 64-byte and 4096-byte regions with range cross penalties. I think 64-byte cross penalty is typically less than 10 cycles, 4096-byte region range cross penalty is several tens of cycles (this on the top of the penalty of accessing DRAM).

  • Interestingly in Java 8 they changed HashMap buckets to hold a binary tree instead of linked list to change from O(n) to O(log N).

    https://stackoverflow.com/questions/24673567/change-to-hashm...

    • To be more accurate, binary trees are constructed only after a bucket ends up containing a small number of elements, which should only happen rarely unless a bad hash function is chosen.

    • That's defending against hash collisions which isn't what the grandparent is noticing (the slowing down of memory accesses due to not fitting into cache).

    • I think they implemented that change to deal with a denial of service attack in http servers. A malicious client could send a large list of parameters with an identical String::hashCode() value, resulting in a single large bucket. Other workarounds included a size limit on the parameter list or a hash function with some parameters randomized at program start.

    • FYI this only happens if the objects implement comparable; if you do not, they will not make binary trees.

  • Big O is about doing the big picture high level analysis without wasting time on every little detail like the complexity of addition and multiplication. High performance implementation on the other hand is all about twiddling the constant factors in your algorithm(s) for your specific hardware architecture/topology.

    See:

    http://kaldewey.com/pubs/FAST__SIGMOD10.pdf

    • The parent's claim implies that on a machine with an abstract cache, the complexity of hashmaps could be proven to be something else. You may be able to recover the right behaivor by modeling a machine with infinite memory, divided up into an infinite tower of caches where each level was P times bigger and Q times slower.

    • You definitely have to take the time complexity of addition/multiplication into account when doing big O analysis. But in most cases they are O(1). It is the constant factors you don’t worry about.

      2 replies →

  • > divergence between what's taught in school and in practice

    I remember this sort of thing being mentioned but not in as much detail as I would have gone into (though I'd self-taught a lot of this before Uni, so for me it was more an exercise in unlearning mistakes and formalising & rounding off knowledge than it was about taking in new concepts, perhaps for others cramming the extra detail in would have been counter productive). Algorithm courses would say things like "of course caching methods should be taking into consideration, you'll cover this in architecture modules" and architecture modules would say "you'll cover the effect on some algorithms in much more detail in your software engineering modules"!

    All the JS sorting demos that were an active fad a short while ago were simplified in the same way: they assumed that either a comparison or a swap was always the most expensive operation and the other can be discounted, or that all swaps/comparisons are equal due to uniform memory speed (no accounting for caching), and no consideration was made for what was being sorted (inserts are much more expensive in an array than a linked list for instance). I keep meaning to find the time to put a less superficial demo together that allows such things to be modelled at a basic level, perhaps even trying to illustrate the effect of a more distributed architecture (where there is more than one processing unit involved and everything isn't in local fast memory).

    Of course, for the most part these simplifications are perfectly valid, but they should always carry a caveat noting what simplifying assumptions are being applied so people learning know that the issues could be more complicated so some critical thinking should be applied in any real world situation.

  • > Just thought it's an interesting divergence between what's taught in school and in practice.

    Not really, no. This isn't a matter of complexity theory being less useful than promised, it's just a matter of understanding what it tells you.

    In complexity theoretic terms, platform-specific micro-optimisations are constant factors. They can have a real impact, but they're only worth worrying about after you've got good complexity.

    Bubble-sort in fine-tuned SIMD assembly code will out-perform bubble-sort in Python, but will be far slower than a Python quicksort (except for small inputs).

    Notice that we're discussing which hash-map to use, as opposed to which data-structure to use in the first place. Their basic time-complexities are the same, so we have the luxury of worrying about the constant factors.

    Linear-scan dictionaries can't compete here, no matter how well micro-optimised. (Again, except for very small inputs.) Complexity theory tells us why.

    Additionally, complexity theory is useful when reasoning about large problems (high values of n), but it doesn't tell you how to do the platform-specific micro-optimisations that are important when attacking small problems. Implementation details will dominate there, rather than the number of algorithmic operations.

    Anyone who teaches complexity theory without explaining these points, has done their students a disservice.

    > I wonder if there's something that's more predictive than big Oh for these types of analysis.

    When it comes to real-world performance, there's no substitute for just measuring real-world performance. Detailed analysis of things like cache behaviour, and behaviour under highly concurrent access, could be neat though.

    Edit For completeness, I should mention that in extreme cases, there are indeed algorithms with superior complexity which are in practice completely useless, as the break-even point is so awfully high that they could never be useful. Matrix multiplication, for instance, where the algorithms with the best time-complexity are quite literally never useful. https://en.wikipedia.org/w/index.php?title=Computational_com...

    Perhaps that contradicts my whole point :-P

    The other limitation of complexity theory is that in practice we often care about average-case performance, whereas theorists tend to care more about worst-care performance.

    • The increasing latency [1] of access to larger amounts of storage isn't a constant factor, though, even asymptotically. There was a fun post on HN a while ago arguing that random access latency to N bytes is asymptotically O(N^0.5), from both empirical data about real systems and theoretically from the Bekenstein bound and the speed of light. And even if you don't buy that it clearly can't be less than O(N^1/3). Don't go preaching this in job interviews, though :-)

      The truth is that complexity theory is always done relative to some abstract machine model, and as with all models it's up to you to determine if the model is a good fit for your practical reality. "To the extent that mathematics reflects reality it is not certain, and to the extent that it is certain it does not reflect reality." There exist abstract models, like the cache oblivious model, that do a somewhat better job of grappling with this issue but don't require to just throw up your hands and rely solely on benchmarks of real hardware. There is probably room for new theory in this area.

      [1] not throughput, which isn't as hard to scale up

      8 replies →

  • The performance gap between performing cache-hot lookups vs lookups that miss can be 10X. The increase is more a reflection of that rather than doing more operations.

    The RAM model doesn't concern itself with memory hierarchies or slow ops (ie divisions are supposed to be as fast as additions.)

    For example, if you were benchmarking cuckoo hashing lookups, the results would look similar even though the number of operations is deterministically O(1) rather than expected O(1).

  • It's not a log-scale graph for the things you talk about. Y-axes are linear scale. So for O(1) algorithm time has an upper bound and plateaus as caches stop being helpful and every operation has to go all the way into the RAM.

  • If we were talking about the worst lookups, then even in the RAM model they would become slower as the map grows.

    To be more specific, if you place n balls into n bins, on average each bin will contain 1 ball, but the largest one, with high probability, will contain O(log(n)/log(log(n))) of them.

  • can any hash table whose hash function operates over the entire string ever show O(1) time?

    • The O(1) amortized access of a hash table is for the number of entries, not key hashing or comparison. Doubling the number of entries does not double the retrieval time.

      As a hash table gets larger, the cost of key hashing (for the same domain of keys) does not increase. Of course hashing and comparison performance can still be important, it's just not what is generally being analyzed with basic complexity theory.

      1 reply →

    • > can any hash table whose hash function operates over the entire string ever show O(1) time?

      If we limit the length of the has key, the time taken by the hash function is also limited and turns into a constant. Big-O is all about asymptotic behavior.

      OTOH noticing how the hash function's run time depends on the size of the key can be separately useful. I suspect all such functions are linear in practice, though.

      1 reply →

  • One thing to think about is that accessing a byte memory in a system with n bytes is pretty much guaranteed to be at best a log(n) operation. After all you need log(n) sized address to index the memory, and you need to read every single one of the bits in the address to index the memory.

    In practice on any particular machine all the addresses are the same length (and we have finite memory), so it doesn't quite explain the effect you are observing in the article. But it does hint that it makes sense.

I personally prefer Judy arrays: http://judy.sourceforge.net/

  • From http://judy.sourceforge.net/doc/10minutes.htm :

    > From a speed point of view Judy is chiefly a 256-ary digital tree or trie (per D. Knuth Volume 3 definitions). A degree of 256-ary is a somewhat "magic" N-ary for a variety of reasons -- but mostly because a byte (the least addressable memory unit) is 8 bits. Also a higher degree means reduced cache-line fills per access. You see the theme here -- avoid cache-line fills like the plague.

    Sounds like a neat data-structure made with cache behaviour in mind. I'm not getting much out of the creator's attempts to explain it, but there's a Wikipedia article: https://en.wikipedia.org/wiki/Judy_array

I wonder how it compares to Google's "Learned Index Structures" https://arxiv.org/abs/1712.01208

  • I don’t think there’s much to write home about RE: Learned Index Structures since classical structures can soundly outperform at least the trumpeted Google “success story” [1]. It’s hype, not substance.

    [1]: https://dawn.cs.stanford.edu/2018/01/11/index-baselines/

    • Completely disagree. It is NOT simply about the results but looking at the direction and a new approach.

      Ultimately it also comes down to the power required to get some task done.

      Also how can a paper submitted to NIPS be hype?

      1 reply →

  • Great link. I will be curious to see if the Google, Jeff Dean, approach works in real life situations.

    What is interesting is you can use the TPUs and parallelize the approach.

    Ultimately it is about using less power to get some work done.

Would be interesting to have comparisons with Rust's std hashmap and indexmap (formerly ordermap) as well.

Props, this is an amazing contribution. So much respect for someone who can really dig in, figure this stuff out, and make a better data-structure for nothing more than the love of doing so. Everyone making their living as a programmer/tech person owes it to people like this who make the basic tools.

  • His ska-sort (a really fast in-place radix sort generalisable to almost all data) is also very interesting, I'm surprised there aren't more people working on the data structures he builds to make little improvements.

    I may actually have figured out a way to (theoretically) make it faster, but I don't grok C++ templates so can't write a pull-request (the general idea is: instead of a standard full swap, use two temporaries and ping-pong between them while swapping elements to do half-swaps. This reduces required assignments to move an element to its final position from three to two).

  • To be fair, the max load factor for this hash map is set to 0.5, which is quite wasteful in terms of memory. If you allow for 50% load in dense_hash_map (and most hash maps for that matter,) they become much faster themselves.

I like this work. We should be seeing more exploratory posts like this one.

I find it interesting that both google's and skarupke@'s hash map have exactly the same name: flat_hash_map. Maybe boost-inspired ?

This post has some overconfident and under-researched claims:

> This came about because last year I wrote the fastest hash table (I still make that claim)

I'm extremely dubious of claims of the "fastest $DATA_STRUCTURE" without qualifiers about workload, hardware, and time/space trade-off. In particular, the last graph of this post shows some workloads (400k int keys, int[16] values, unsuccessful lookups) where the author's implementation of Swiss Table (as Google called their table) is 3x-10x slower than "bytell_hash_map", the author's own design.

> google_flat16_hash_map. This last one is my implementation of Google’s new hash table. Google hasn’t open-sourced their hash table yet, so I had to implement their hash table myself. I’m 95% sure that I got their performance right.

The talk announcing Swiss Table clearly didn't have enough detail to fully replicate their results. This 95% claim is rather astonishing in light of that.

  • Hmm I’m seeing the opposite. The last graph shows Google’s hash table outperforming in all but a few places, including at 400k. Perhaps you mixed the axes up, lower is better here.

    I also just want to comment that skepticism to high claims is good but your comment seems a bit too harsh. Briefly looking at his post where he goes into his “fastest hash table” it’s quite long, very detailed and has similar graphs. That doesn’t seem under-researched to me.

    • > Hmm I’m seeing the opposite. The last graph shows Google’s hash table outperforming in all but a few places, including at 400k.

      You're exactly right, General Pizza. I was trying to say that the Google table was faster, even though the author claimed that his or her table was "fastest" without qualification.

      Thanks for pointing that out; I'll leave my post as it is so yours continues to make sense in context.

  • meta: classic HN, top post is a Negative Nancy regardless of how impressive the new thing is. Is there a name for this phenomena yet?

    • I know the type of post you talk about - dismissive easy critiques like "correlation does not.." - but GP's remark is not quite like that:

      - it states the article has some overconfident claims.

      - it gives detailed arguments for why

      > Is there a name for this phenomena yet?

      "bachelors' wives and maidens' children are well taught"?

    • No and there should not be. There unjustified critique and justified critique.

      ´