← Back to context

Comment by ztorkelson

8 years ago

I've found myself reaching for radix trees more and more for the same reason you listed: competitive and predictable performance.

That said, I can see two counter-arguments:

1. Rehashing can be avoided if the table size is known in advance. 2. Rehashing is immaterial if the data structure is write-once, read-many.

In those cases, we might reasonably expect a hash table to perform better.

If the use case is write-once, read-many, you can simply build a perfect hash table. Using techniques like CHD the cost is IME usually in the range of 1x (no cost) to 3x, depending on how compact you want the table, how large it is, and how much time you want to spend up front.

Examples from my own small library (http://25thandclement.com/~william/projects/phf.html), using 1 million random integers:

  80% load -> 0.47s
  85% load -> 0.56s
  90% load -> 0.74s
  95% load -> 1.10s
  98% load -> 1.35s
  99% load -> 1.68s

If I hash the items as C++ string objects rather than uint32 scalars then the time is doubled, presumably because of the dynamic allocation. Which goes to show the "perfect" in perfect hash table doesn't need to be the dominate cost.

That said, 90% of the time I just use a simple red-black tree--a so-called intrusive implementation. It makes memory management much easier in C, and rare is the situation where a hash table (let alone a perfect hash table) can make a real difference relative to focusing on all the other inefficiencies.

> 1. Rehashing can be avoided if the table size is known in advance

With an open addressing implementation (Google's dense_hash_map, for example) this is only true if you rarely or never need to remove items from the map.