They praise bloom filters for being simple and describe how it works in two sentences. They criticize cuckoo filters for being too complicated. They praise their own algorithm for being simple, and then they don't describe it, and instead link to a repo of several hundred lines of code.
My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else. I'm not ashamed to admit that I have several such code blobs in production right now, but I won't pretend to call them simple.
The xor filter is more complex, but the concepts behind it are quite accessible. Let's give it a try.
The xor filter queries for a key by computing three hashes h1(k), h2(k), h3(k) of the key k and using those to index into three arrays a1, a2, a3 of K-bit values. The 3 values loaded from those arrays are xor'd together and compared to a fingerprint f(k).[0] If they are equal, the key is assumed to be contained in the filter.
Assuming that the fingerprint function is random, the probability of false positives is 2^-K.
Constructing the xor filter requires choosing the three hash functions at random, and solving the system of linear equations given by:
a1[h1(k_i)] + a2[h2(k_i)] + a3[h3(k_i)] = f(k_i) for i = 1..N
If the arrays are big enough (each one a fraction larger than N/3 where N is the number of elements), then the probability is high that the system has a solution (this comes down to the hyperedges {h1(k_i), h2(k_i), h3(k_i)} being "acyclical" -- the acyclic part would be easier to understand if you only had two hash functions giving you normal undirected edges, but the proof of acyclicity only works with 3 or more hash functions). If it doesn't, you just pick new hash functions at random and retry.
The complicated part of the algorithm is solving the system of linear equations efficiently. The complicated part of the correctness proof is showing that the system has a solution with high probability.
[0] The fingerprint is also a hash function. The difference is that f(k) can be fixed for the algorithm, whereas the h1, h2, h3 need to be chosen randomly when the xor filter is built.
Edit: Note that the paper combines the three arrays into a single large one, but that's an implementation detail. Though it makes you wonder whether one couldn't have three hash functions that each cover the entire range of the larger array and still make it work. That could potentially decrease the required size of the array marginally, at the expense of a more difficult proof.
He probably did not talk about it because this post was an announcement of his paper: https://arxiv.org/abs/1912.08258, the XOR filter description starts at page 3.
I'd say that the requirement to solve a system of linear equations that may or may not even have a solution in order to construct the filter might qualify as hard to follow.
Neither the blog nor the repo contain benchmarks, which is bit weird when the title of a technical article has word "Faster".
Then after digging through the actual paper it turns out that the construction performance of Xor Filter is actually massively slower than Bloom.. May not matter for many uses, but critical for others (like mine).
Co-author of the paper here. The repo https://github.com/FastFilter/fastfilter_cpp contains all the source code and benchmarks. Yes, construction of the xor filter is slower, but I wouldn't say massively. You should probably compare Bloom 12 against Xor 8, as those have similar false positive rates. Bloom 12 is 60 ns/key, versus Xor 8 at 110 ns/key. Cuckoo filter is somewhere between. Those number don't include calculating the hash code of a key, which might be another 30 ns/key or so.
What is your use case? Maybe Blocked Bloom might be better (if you have the memory).
Thanks, found the benchmarks (your blog links primarily to different repo, "/xorfilter").
My use case is a new kind of debugger (https://BugJail.com), which works by capturing 20+ million events per second from running application, and then reconstructing model of program execution from the raw events. With 20+ million events per second that 110ns per key is just not going to work..
The paper claims that it is more space efficient than golomb coded sequences, but this is clearly not the case at least when the golomb parameters and FP rate are set optimally[1] a GCS can be well within a fraction of 1% of the information theoretic limit.
The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.
Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.
(Co-author here) For GCS, I also found the Golomb-Rice parameter Q-1 to be better, this is what we used. According to my calculation, the overhead of GCS is around 1.5 bit/key. See also https://github.com/0xcb/Golomb-coded-map . Sure, with arithmetic coding, or ANS coding, it should be lower; it would be interesting to see how much. I kind of like GCS.
The overhead you're seeing is because you've mismatched your FP rate with the optimal size for your GCS code. If they're exactly matched the overhead is very low (not as low as an arithmetic code simply because each entry is still coded into bits, so you get something like a half-bit overhead on average from that).
If I were coding a static approximate membership query structure, I wouldn't use either bloom filters or GCS, I'd use an Elias-Fano-coded sequence of fingerprints. That has nearly the same compression efficiency as Golomb coding but can be efficiently queried with no decompression.
Interesting idea! Elias-Fano monotone sequences unfortunately do need a bit more space than just a list of Rice codes. What is possible, without having to increase the space usage, is to re-arrange the Rice codes in each bucket, so that all variable parts come first, and the fixed parts at the end. That way, lookup speed would be much, much faster for large buckets. We used this for RecSplit: https://arxiv.org/abs/1910.06416 I will try that out for GCS as well!
I haven't fully read the paper yet, but it seems like a variation of a bloom filter where membership is determined by xoring three hashes (see algorithm 1 in the pdf)
"Membership test: returns true if the key x is likely in S, false otherwise"
I think they circumvent the problem by making a complicated insertion procedure where all of the keys need to be inserted at once, it seems uses a stack to avoid hash collisions
One of the advantages of bloom filters is that you can also do intersections, unions and cardinality estimations thereof on remote sets. These don't seem to support that.
Yes it is possible to do this, but do people actually use those features? For (pure) cardinality estimation, there is HyperLogLog, which can also be merged...
I did, not sure how common it is though. For much-larger-than-memory problems it can give a tremendous speedup to be able to do an estimate or logical operation(s) before having to hit the disk.
As far as I can tell it doesn’t support insertion operations. That’s a massive drawback vs both cuckoo and bloom filters.
It’s somewhat disingenuous that they are clearly attempting to position this as being superior to the other data structures, while their’s cannot do one of the core operations.
I think there are many use cases where you don't add entries after construction, for example for log-structured merge trees. You anyway can't add too many, otherwise it will hurt the false positive rate. What _are_ important use case where you add entries afterwards? BTW for cuckoo filters, add and remove can sometimes fail (depending on parameters and luck).
Assuming you're talking about Google Safe Browsing it needs the ability to update the data structures. Implementations vary but I wouldn't use Bloom Filters and this seems even less appropriate.
I'm sure there's some application where this is the right choice, but the reaction here shows that people don't like being told the propaganda approach to new data structures. Just returning true is also technically "faster and smaller than Bloom Filters" as the headline for this item says when I write this. Almost useless, but faster and smaller.
It is absolutely outrageous that you are accusing the authors of this work of dishonesty because of limitations they explicitly called out in their introductory post as well as the paper.
When the table is less full than some cliff-effect-threshold the probability of failure is arbitrarily negligible.
If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.
If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.
[Of course, all assuming someone can't attack your hash function.]
Achives optimally 36% space saving compared to BL but needs an fp accelerator to achieve the same throughput. Compares to perfect hasing, which also does not suppprt insertions.
Comparing a static data structure to two well-known dynamic structures is disingenuous, to say the least. And for a static structure you can do better than just a minimal perfect hash table of fingerprints. You can use a succinct data structure to compress the fingerprints in a form that is still efficiently queryable (unlike say Golomb-compressed sequences):
https://www.antoniomallia.it/sorted-integers-compression-wit...
VLDB also just published a paper on a similar, improved bloom-like data structure
"Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters"
http://www.vldb.org/pvldb/vol13/p197-wang.pdf
It would be interesting to get a comparison between Xor and Vacuum.
To me it looks like a version of the cuckoo filter that isn't restricted to a 2^n size. We made the same change in the benchmark of the xor filter against the cuckoo filter (and others made the same change). There are some more changes, and the paper claims the vacuum filter is faster than the cuckoo filter. I will try to add the vacuum filter to our benchmark suite at https://github.com/FastFilter/fastfilter_cpp and see what results I get.
How is this different from a hashset where you simply never do the final key comparison? As far as I can tell both require linear storage space and have around the same storage complexity.
It seems like the main "innovation" here is simply using a perfect hash function for your hashset.
From the paper: "We store the fingerprints in an array B with capacity c slightly larger than the cardinality of the set |S | (i.e., c ≈ 1.23 × |S |)."
The constant factor is different. If you want a small number of collisions, you will need a very small load-factor (you can't do most techniques for handling collisions because you aren't storing the key). So if you treat your hash-table as an array of bits and you run at a load-factor of say .1, then you are now using 10 bits per key, which is a lot more space than a bloom filter or an xor filter with the same false-positive rate.
Bloom filters don't require linear space. You can essentially create a set of arbitrarily large objects, as long as you have a hash for them. It requires O(C) memory where C is the size of your largest hash function. As a trade-off, it's not deterministic, since there can be hash collisions due to pigenhole principle. So, even though bloom filter claims the object is in the set, there is a certain probability (that you can calculate) that it's not actually there. On the other hand, a tree-set or hash-set will be deterministically correct, but they require O(N) space and tree-set requires O(logN) lookup cost.
Bloom filters absolutely call for linear space if you are using them appropriately: given the hash collision probability you are working with the size of the required filter is linear in the number of elements you want to store. Yes, it is "technically true" (the most useless kind of true) that you can store as much stuff as you want in the filter... but you get increasingly worse false positive rates that eventually just saturate to near 100%. If you actually read any of the math--including this linked work on Xor filters--you will always see them described in terms of "bits per item stored" (where the number of bits is extremely small vs. what you would expect). Hell: Wikipedia even describes the algorithm's storage requirements as "A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.".
They praise bloom filters for being simple and describe how it works in two sentences. They criticize cuckoo filters for being too complicated. They praise their own algorithm for being simple, and then they don't describe it, and instead link to a repo of several hundred lines of code.
My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else. I'm not ashamed to admit that I have several such code blobs in production right now, but I won't pretend to call them simple.
The xor filter is more complex, but the concepts behind it are quite accessible. Let's give it a try.
The xor filter queries for a key by computing three hashes h1(k), h2(k), h3(k) of the key k and using those to index into three arrays a1, a2, a3 of K-bit values. The 3 values loaded from those arrays are xor'd together and compared to a fingerprint f(k).[0] If they are equal, the key is assumed to be contained in the filter.
Assuming that the fingerprint function is random, the probability of false positives is 2^-K.
Constructing the xor filter requires choosing the three hash functions at random, and solving the system of linear equations given by:
If the arrays are big enough (each one a fraction larger than N/3 where N is the number of elements), then the probability is high that the system has a solution (this comes down to the hyperedges {h1(k_i), h2(k_i), h3(k_i)} being "acyclical" -- the acyclic part would be easier to understand if you only had two hash functions giving you normal undirected edges, but the proof of acyclicity only works with 3 or more hash functions). If it doesn't, you just pick new hash functions at random and retry.
The complicated part of the algorithm is solving the system of linear equations efficiently. The complicated part of the correctness proof is showing that the system has a solution with high probability.
[0] The fingerprint is also a hash function. The difference is that f(k) can be fixed for the algorithm, whereas the h1, h2, h3 need to be chosen randomly when the xor filter is built.
Edit: Note that the paper combines the three arrays into a single large one, but that's an implementation detail. Though it makes you wonder whether one couldn't have three hash functions that each cover the entire range of the larger array and still make it work. That could potentially decrease the required size of the array marginally, at the expense of a more difficult proof.
He probably did not talk about it because this post was an announcement of his paper: https://arxiv.org/abs/1912.08258, the XOR filter description starts at page 3.
There's also the linked paper that has algorithmic description (starts on p. 3).
I agree, I kept reading to see their approach and "woof, there it is, so much better, you work it out!". sigh.
Rule #1 of getting something adopted breached.
>My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else.
Or, one could just read the code instead of guessing - it's not that hard to follow, similar to Bloom idea + XORing the hashes.
> not that hard to follow
I'd say that the requirement to solve a system of linear equations that may or may not even have a solution in order to construct the filter might qualify as hard to follow.
Very well said.
Neither the blog nor the repo contain benchmarks, which is bit weird when the title of a technical article has word "Faster".
Then after digging through the actual paper it turns out that the construction performance of Xor Filter is actually massively slower than Bloom.. May not matter for many uses, but critical for others (like mine).
> Blocked Bloom = 10 ns/key
> Bloom 8 = 40 ns/key
> Xor 8 = 110 ns/key
Co-author of the paper here. The repo https://github.com/FastFilter/fastfilter_cpp contains all the source code and benchmarks. Yes, construction of the xor filter is slower, but I wouldn't say massively. You should probably compare Bloom 12 against Xor 8, as those have similar false positive rates. Bloom 12 is 60 ns/key, versus Xor 8 at 110 ns/key. Cuckoo filter is somewhere between. Those number don't include calculating the hash code of a key, which might be another 30 ns/key or so.
What is your use case? Maybe Blocked Bloom might be better (if you have the memory).
Thanks, found the benchmarks (your blog links primarily to different repo, "/xorfilter").
My use case is a new kind of debugger (https://BugJail.com), which works by capturing 20+ million events per second from running application, and then reconstructing model of program execution from the raw events. With 20+ million events per second that 110ns per key is just not going to work..
12 replies →
Yes but grandparent is correct. The first word in the title is still “faster”.
2 replies →
The paper claims that it is more space efficient than golomb coded sequences, but this is clearly not the case at least when the golomb parameters and FP rate are set optimally[1] a GCS can be well within a fraction of 1% of the information theoretic limit.
The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.
Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.
[1] https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc84...
(Co-author here) For GCS, I also found the Golomb-Rice parameter Q-1 to be better, this is what we used. According to my calculation, the overhead of GCS is around 1.5 bit/key. See also https://github.com/0xcb/Golomb-coded-map . Sure, with arithmetic coding, or ANS coding, it should be lower; it would be interesting to see how much. I kind of like GCS.
The overhead you're seeing is because you've mismatched your FP rate with the optimal size for your GCS code. If they're exactly matched the overhead is very low (not as low as an arithmetic code simply because each entry is still coded into bits, so you get something like a half-bit overhead on average from that).
5 replies →
If I were coding a static approximate membership query structure, I wouldn't use either bloom filters or GCS, I'd use an Elias-Fano-coded sequence of fingerprints. That has nearly the same compression efficiency as Golomb coding but can be efficiently queried with no decompression.
Interesting idea! Elias-Fano monotone sequences unfortunately do need a bit more space than just a list of Rice codes. What is possible, without having to increase the space usage, is to re-arrange the Rice codes in each bucket, so that all variable parts come first, and the fixed parts at the end. That way, lookup speed would be much, much faster for large buckets. We used this for RecSplit: https://arxiv.org/abs/1910.06416 I will try that out for GCS as well!
Direct link to paper (pdf) - https://arxiv.org/pdf/1912.08258.pdf
I haven't fully read the paper yet, but it seems like a variation of a bloom filter where membership is determined by xoring three hashes (see algorithm 1 in the pdf)
If that's the case then both membership and non-membership will be probabilistic, rather than one or the other being certain like boom filters, right?
"Membership test: returns true if the key x is likely in S, false otherwise"
I think they circumvent the problem by making a complicated insertion procedure where all of the keys need to be inserted at once, it seems uses a stack to avoid hash collisions
Take a look at https://github.com/FastFilter/xorfilter/blob/master/xorfilte...
I can usually read go, but I've never been that great at deciphering "academic code" so the exact details escape me here.
3 replies →
One of the advantages of bloom filters is that you can also do intersections, unions and cardinality estimations thereof on remote sets. These don't seem to support that.
Yes it is possible to do this, but do people actually use those features? For (pure) cardinality estimation, there is HyperLogLog, which can also be merged...
> do people actually use those features?
I did, not sure how common it is though. For much-larger-than-memory problems it can give a tremendous speedup to be able to do an estimate or logical operation(s) before having to hit the disk.
1 reply →
As far as I can tell it doesn’t support insertion operations. That’s a massive drawback vs both cuckoo and bloom filters.
It’s somewhat disingenuous that they are clearly attempting to position this as being superior to the other data structures, while their’s cannot do one of the core operations.
I think there are many use cases where you don't add entries after construction, for example for log-structured merge trees. You anyway can't add too many, otherwise it will hurt the false positive rate. What _are_ important use case where you add entries afterwards? BTW for cuckoo filters, add and remove can sometimes fail (depending on parameters and luck).
There are real-world applications where this* would drop in. I believe Chrome's malicious-site blocker employs a Bloom filter blob.
* assuming it does what it says, I haven't looked.
Assuming you're talking about Google Safe Browsing it needs the ability to update the data structures. Implementations vary but I wouldn't use Bloom Filters and this seems even less appropriate.
I'm sure there's some application where this is the right choice, but the reaction here shows that people don't like being told the propaganda approach to new data structures. Just returning true is also technically "faster and smaller than Bloom Filters" as the headline for this item says when I write this. Almost useless, but faster and smaller.
1 reply →
It is absolutely outrageous that you are accusing the authors of this work of dishonesty because of limitations they explicitly called out in their introductory post as well as the paper.
The compared Cuckoo filters have a similar property, no? (Depends on luck if you can update it or not)
When the table is less full than some cliff-effect-threshold the probability of failure is arbitrarily negligible.
If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.
If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.
[Of course, all assuming someone can't attack your hash function.]
1 reply →
It would be more fair to compare with minimal perfect hashing, as it does not support insertion too
Also of academic interest; Neural Bloom filter: https://arxiv.org/abs/1906.04304
Achives optimally 36% space saving compared to BL but needs an fp accelerator to achieve the same throughput. Compares to perfect hasing, which also does not suppprt insertions.
Comparing a static data structure to two well-known dynamic structures is disingenuous, to say the least. And for a static structure you can do better than just a minimal perfect hash table of fingerprints. You can use a succinct data structure to compress the fingerprints in a form that is still efficiently queryable (unlike say Golomb-compressed sequences): https://www.antoniomallia.it/sorted-integers-compression-wit...
Elias-fano are overestimated without any proof or comparison benchmarks.
Like other integer compression methods, you still need offsets to small blocks for efficient random access.
Getting a single value in a block is done with slow branchy bit ops like ctz and bit shifting. That's decoding
The fastest elias-fano implementation is in https://github.com/powturbo/TurboPFor and is partially using SIMD
VLDB also just published a paper on a similar, improved bloom-like data structure "Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters" http://www.vldb.org/pvldb/vol13/p197-wang.pdf
It would be interesting to get a comparison between Xor and Vacuum.
Thanks for the link! I wasn't aware of this. The source code is available at https://github.com/wuwuz/Vacuum-Filter/
To me it looks like a version of the cuckoo filter that isn't restricted to a 2^n size. We made the same change in the benchmark of the xor filter against the cuckoo filter (and others made the same change). There are some more changes, and the paper claims the vacuum filter is faster than the cuckoo filter. I will try to add the vacuum filter to our benchmark suite at https://github.com/FastFilter/fastfilter_cpp and see what results I get.
Interesting. Where does 3 come from for the number of hash functions? Why not 4 or 5?
The minimum number motivated to work is also the cheapest. 2 was simply deemed insecure.
No, actually 2 would also work. It's just that 3 hash functions needs the least space for some reason (less than 2, less than 4 or more).
2 replies →
Insecure? What does that mean in the context of a read-only data structure?
1 reply →
How is this different from a hashset where you simply never do the final key comparison? As far as I can tell both require linear storage space and have around the same storage complexity.
It seems like the main "innovation" here is simply using a perfect hash function for your hashset.
From the paper: "We store the fingerprints in an array B with capacity c slightly larger than the cardinality of the set |S | (i.e., c ≈ 1.23 × |S |)."
The constant factor is different. If you want a small number of collisions, you will need a very small load-factor (you can't do most techniques for handling collisions because you aren't storing the key). So if you treat your hash-table as an array of bits and you run at a load-factor of say .1, then you are now using 10 bits per key, which is a lot more space than a bloom filter or an xor filter with the same false-positive rate.
Bloom filters don't require linear space. You can essentially create a set of arbitrarily large objects, as long as you have a hash for them. It requires O(C) memory where C is the size of your largest hash function. As a trade-off, it's not deterministic, since there can be hash collisions due to pigenhole principle. So, even though bloom filter claims the object is in the set, there is a certain probability (that you can calculate) that it's not actually there. On the other hand, a tree-set or hash-set will be deterministically correct, but they require O(N) space and tree-set requires O(logN) lookup cost.
Bloom filters absolutely call for linear space if you are using them appropriately: given the hash collision probability you are working with the size of the required filter is linear in the number of elements you want to store. Yes, it is "technically true" (the most useless kind of true) that you can store as much stuff as you want in the filter... but you get increasingly worse false positive rates that eventually just saturate to near 100%. If you actually read any of the math--including this linked work on Xor filters--you will always see them described in terms of "bits per item stored" (where the number of bits is extremely small vs. what you would expect). Hell: Wikipedia even describes the algorithm's storage requirements as "A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.".
2 replies →
How big of a problem can false positives get?
In terms of the URL blacklist, is the 1% of false positives 1% of all possible URLs? Or of the amount of requests to the filter? Or what?
Of the amount of blacklisted URLs, for sure. The filter doesn't know anything about possible URLs.
The C++ code, anyway, is remarkably prolix, probably 4-10x. It is probably best that they did not attempt Rust, under the circumstances.
From the blog post: > It would be relatively easy to start from the Go version and produce a Rust version, but I had to stop somewhere.
(I bet someone in hacker news is saying "challenge accepted!" and we'll have a xorfilter crate very soon)
Yes, I read that.
The point was that as bad as their C++ code is, their Rust version might be even more disappointing.
Downvoting me just for mentioning Rust only demonstrates immaturity.
4 replies →