Comment by thomasmg
1 month ago
There are many variants. It really depends on what features you need. Cuckoo filters were mentioned. If you want to add and remove entries and the regular counting Bloom filter are not good enough, I would look at the "succinct counting blocked Bloom filter" [1]: they only need about twice the space than regular Bloom filters. Sure, cuckoo filters need less memory, but they can fail basically any time, while these can not.
Tiny filters: Some time ago I worked on tiny statistics [2]. This includes some 64-bit HyperLogLog implementations; some use linear counting, which is basically a 64-bit Bloom filter, until some limit, and only then switch to HyperLogLog. This is great for distinct counts of columns in databases (cardinality estimation). This project also includes 64-bit approximate counts and histograms.
[1] https://github.com/FastFilter/fastfilter_java/blob/master/fa... [2] https://github.com/thomasmueller/tinyStats
FWIW, I found https://github.com/FastFilter/fastfilter_java/issues/28 a pretty good explanation of what's going on in the succinct counting blocked Bloom filters. (I'm not sure if the blocking is needed for the normal Bloom filter part, though, i.e., all bits don't necessarily need to fall within the same block, no? But the counts are stored separately for each block.)
Yes, non-blocked is also possible. This would need a bit less space, and would be a bit slower. The counts > 1 (per bit that is set) are stored spearately, yes.
> This would need a bit less space, and would be a bit slower.
I guess that is because the count storage update is really slow, right, so it's better to have one than two (or whatever number of set bits) operations? At least the linked code seems to process it one by one bit when updating, and without some sort of “rank of n-th set bit” operation (which would accelerate the “select” version fairly well), I'm not sure it could be made much faster than that either.
Edit: I see https://github.com/FastFilter/fastfilter_java/blob/master/fa... tries to make that operation in O(1), but to be honest, with this many operations, the cure almost looks worse than the disease. :-) Haven't benchmarked, though.
5 replies →