Comment by antirez

8 years ago

Does this implementation stops on rehashing? We should stop considering seriously hash tables that stop for rehashing because they can be used only for a subset of problems, due to the latency. This is one of the main reasons I'm now in love with radix trees and trying to get rid of hash tables at every level in my code.

24 comments

antirez

wrsh07 8 years ago

It's a bit funny to write off a hash table that is by design competing with another hash table by saying "it's not a different data structure, so I'm not interested."

Moreover, this kind of optimization only makes sense if (1) latency, not throughput, is your bottleneck or (2) if you're really concerned about parallel accesses.

A reasonable default in general is to use a hash table + mutex until that is your bottleneck. (I've yet to encounter a scenario where switching from a hash map would provide significant gains in performance-relevant code)

Disclosure: I work at Google and use the fancy hash map discussed by Matt. I didn't work on it, though.

ztorkelson 8 years ago
That's not a very charitable interpretation of the GP. He's not criticizing hash tables because they aren't radix trees. He's criticizing this and other hash table implementations because they have undesirable performance characteristics with respect to his needs, and noting an alternative data structure which he has found to be more suitable. I think that's fair game.
Note that there exist hash table implementations which support incremental resizing.
I do agree, though, that it's a leap to go from "hash tables with this property are inappropriate under these constraints" to "hash tables with this property should not be considered further", precisely because there is such a broad spectrum of data structure requirements across the industry.
- eternalban 8 years ago
  
  > Note that there exist hash table implementations which support incremental resizing.
  Rather, there exist hash table algorithms that support incremental resizing.
  http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/cours...
- stochastic_monk 8 years ago
  
  Additionally, a hash table could be implemented in such a way that it could be updated during resizing. As long as you set up locks in the right places, you could spend the extra memory to do an out-of-place rehash and allow both concurrent writers and your rehashing thread to write to the new table. It's doable, if difficult and error-prone.
beeforpork 8 years ago

Radix trees are very nice data structure. There is one drawback I never see in comparisons: they may require non-constant amount of space for an insertion operation, i.e., they may introduce more nodes than the one that is inserted, because nodes may need to be split. This means you cannot burden the space allocation on the caller of 'insert' (and the deallocation on the caller of 'remove'), e.g. by (statically) preallocating a node cell in each item that can potentially be part of a radix tree.
This is important in environments where you have or do not want to use malloc/free. E.g. in embedded systems, exactly where some properties like worst case time behaviour is important and, therefore, radix trees seem like an option: they still might not be.
Well, but hash tables are not an option either in these situations, so good old binary search trees win (red-black, avl, 2-4, whatever).

ztorkelson 8 years ago

I've found myself reaching for radix trees more and more for the same reason you listed: competitive and predictable performance.

That said, I can see two counter-arguments:

1. Rehashing can be avoided if the table size is known in advance. 2. Rehashing is immaterial if the data structure is write-once, read-many.

In those cases, we might reasonably expect a hash table to perform better.

wahern 8 years ago
If the use case is write-once, read-many, you can simply build a perfect hash table. Using techniques like CHD the cost is IME usually in the range of 1x (no cost) to 3x, depending on how compact you want the table, how large it is, and how much time you want to spend up front.
Examples from my own small library (http://25thandclement.com/~william/projects/phf.html), using 1 million random integers:
80% load -> 0.47s 85% load -> 0.56s 90% load -> 0.74s 95% load -> 1.10s 98% load -> 1.35s 99% load -> 1.68s
If I hash the items as C++ string objects rather than uint32 scalars then the time is doubled, presumably because of the dynamic allocation. Which goes to show the "perfect" in perfect hash table doesn't need to be the dominate cost.
That said, 90% of the time I just use a simple red-black tree--a so-called intrusive implementation. It makes memory management much easier in C, and rare is the situation where a hash table (let alone a perfect hash table) can make a real difference relative to focusing on all the other inefficiencies.
usefulcat 8 years ago

> 1. Rehashing can be avoided if the table size is known in advance
With an open addressing implementation (Google's dense_hash_map, for example) this is only true if you rarely or never need to remove items from the map.

bluestreak 8 years ago

It does rehash when full. Presumably it would rehash less often than most because it remains stable at very high load factor.

https://en.wikipedia.org/wiki/Radix_tree looks like a funky structure for _very_ specific work loads.

MaxBarraclough 8 years ago

Are radix trees always better? I doubt it.

utopcell 8 years ago

Not always, but take a look at this recent comparison between modern radix trees and hash maps [1].
[1] https://bigdata.uni-saarland.de/publications/ARCD15.pdf
jandrewrogers 8 years ago

Not always better, but for most cases where people typically use a hash table it frequently is, assuming the radix tree design and implementation is excellent. There are still a few cases where well-designed hash tables are a better choice e.g. fixed size cache maps with unique keys. But to agree with the OP, there are very few hash tables in my software these days.
The downside of radix trees is the space of algorithm designs is pretty exotic, especially high-performance implementations. It is much easier to write a good hash table implementation than a good radix tree implementation.
thecompilr 8 years ago
Depending on implementation and usage, radix trees can get extremely segmented, slowing down significantly over time.
- ztorkelson 8 years ago
  
  Would you mind elaborating? I'm not quite sure what you mean by segmented here.
  
  9 replies →
kwillets 8 years ago
Try these string keys:
{ i * "a" : i = 1 to n }
(also great for breaking string sorts)