Comment by susam
4 days ago
I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.
This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.
Ha, collision avoidance all the way down
Interesting trick about the constant value, and thank you for the detailed write up!
[flagged]