Comment by antirez
8 years ago
Does this implementation stops on rehashing? We should stop considering seriously hash tables that stop for rehashing because they can be used only for a subset of problems, due to the latency. This is one of the main reasons I'm now in love with radix trees and trying to get rid of hash tables at every level in my code.
It's a bit funny to write off a hash table that is by design competing with another hash table by saying "it's not a different data structure, so I'm not interested."
Moreover, this kind of optimization only makes sense if (1) latency, not throughput, is your bottleneck or (2) if you're really concerned about parallel accesses.
A reasonable default in general is to use a hash table + mutex until that is your bottleneck. (I've yet to encounter a scenario where switching from a hash map would provide significant gains in performance-relevant code)
Disclosure: I work at Google and use the fancy hash map discussed by Matt. I didn't work on it, though.
That's not a very charitable interpretation of the GP. He's not criticizing hash tables because they aren't radix trees. He's criticizing this and other hash table implementations because they have undesirable performance characteristics with respect to his needs, and noting an alternative data structure which he has found to be more suitable. I think that's fair game.
Note that there exist hash table implementations which support incremental resizing.
I do agree, though, that it's a leap to go from "hash tables with this property are inappropriate under these constraints" to "hash tables with this property should not be considered further", precisely because there is such a broad spectrum of data structure requirements across the industry.
> Note that there exist hash table implementations which support incremental resizing.
Rather, there exist hash table algorithms that support incremental resizing.
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/cours...
Additionally, a hash table could be implemented in such a way that it could be updated during resizing. As long as you set up locks in the right places, you could spend the extra memory to do an out-of-place rehash and allow both concurrent writers and your rehashing thread to write to the new table. It's doable, if difficult and error-prone.
Radix trees are very nice data structure. There is one drawback I never see in comparisons: they may require non-constant amount of space for an insertion operation, i.e., they may introduce more nodes than the one that is inserted, because nodes may need to be split. This means you cannot burden the space allocation on the caller of 'insert' (and the deallocation on the caller of 'remove'), e.g. by (statically) preallocating a node cell in each item that can potentially be part of a radix tree.
This is important in environments where you have or do not want to use malloc/free. E.g. in embedded systems, exactly where some properties like worst case time behaviour is important and, therefore, radix trees seem like an option: they still might not be.
Well, but hash tables are not an option either in these situations, so good old binary search trees win (red-black, avl, 2-4, whatever).
I've found myself reaching for radix trees more and more for the same reason you listed: competitive and predictable performance.
That said, I can see two counter-arguments:
1. Rehashing can be avoided if the table size is known in advance. 2. Rehashing is immaterial if the data structure is write-once, read-many.
In those cases, we might reasonably expect a hash table to perform better.
If the use case is write-once, read-many, you can simply build a perfect hash table. Using techniques like CHD the cost is IME usually in the range of 1x (no cost) to 3x, depending on how compact you want the table, how large it is, and how much time you want to spend up front.
Examples from my own small library (http://25thandclement.com/~william/projects/phf.html), using 1 million random integers:
If I hash the items as C++ string objects rather than uint32 scalars then the time is doubled, presumably because of the dynamic allocation. Which goes to show the "perfect" in perfect hash table doesn't need to be the dominate cost.
That said, 90% of the time I just use a simple red-black tree--a so-called intrusive implementation. It makes memory management much easier in C, and rare is the situation where a hash table (let alone a perfect hash table) can make a real difference relative to focusing on all the other inefficiencies.
> 1. Rehashing can be avoided if the table size is known in advance
With an open addressing implementation (Google's dense_hash_map, for example) this is only true if you rarely or never need to remove items from the map.
It does rehash when full. Presumably it would rehash less often than most because it remains stable at very high load factor.
https://en.wikipedia.org/wiki/Radix_tree looks like a funky structure for _very_ specific work loads.
Are radix trees always better? I doubt it.
Not always, but take a look at this recent comparison between modern radix trees and hash maps [1].
[1] https://bigdata.uni-saarland.de/publications/ARCD15.pdf
Not always better, but for most cases where people typically use a hash table it frequently is, assuming the radix tree design and implementation is excellent. There are still a few cases where well-designed hash tables are a better choice e.g. fixed size cache maps with unique keys. But to agree with the OP, there are very few hash tables in my software these days.
The downside of radix trees is the space of algorithm designs is pretty exotic, especially high-performance implementations. It is much easier to write a good hash table implementation than a good radix tree implementation.
Depending on implementation and usage, radix trees can get extremely segmented, slowing down significantly over time.
Would you mind elaborating? I'm not quite sure what you mean by segmented here.
9 replies →
Try these string keys:
(also great for breaking string sorts)