← Back to context

Comment by syllogism

3 years ago

If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash

There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

I'm in strings, not 64 bit keys. But thanks, nice to share ideas.

  • The idea is to hash the string into a 64-bit key. You can store the string in a value, or you can have a separate vector and make the value a struct that has the key and the value.

    The chance of colliding on the 64-bit space is low if the hash distributes evenly, so you just yolo it.