← Back to context

Comment by svat

1 year ago

Just finished reading the article finally (thanks!). The crux of it for me was:

- They had a "dictionary" of 30000 words, and accepting a ~1/4000 rate of false positives meant that if they hashed each word to a 27-bit string (integer), they could throw away the dictionary and the problem reduces to storing a set of 30000 27-bit strings.

- Somewhat surprisingly, information theory tells us that 30000 27-bit strings can be stored using not 27 but just ~13.57 bits per word. I understand the math (it's straightforward: https://www.wolframalpha.com/input?i=log_2%282%5E27+choose+3... ) but it will probably take me a while to stop finding this counterintuitive, as 30000 is so small compared to 2^27 (which is ~134 million) that it is hard to see where the gains come from.

- To encode this 30000-sized subset of 27-bit hashes, they used hash differences, which turn out to be geometrically distributed, and a coding scheme tuned for geometrically distributed input (Golomb coding), to actually achieve ~13.6 bits per word.

I've tried to think of how one could do better, even in principle and with infinite time, along the lines of “perfect hashing” — maybe there should be a function that will take an alphabetic word, do some transformations on it, and the resulting hash will be easy to verify for being in the good set vs not. But thinking about it a bit more, the fact that we need that false-positive rate (non-dictionary words shouldn't get mapped to anything in the "good" set) requires us to use at least 27 bits for the hash. What they did seems basically theoretically optimal? Or can there exist a way to map each word to a 27-bit integer, such that the good strings are those with values less than 30000, say?