Xor Filters: Faster and Smaller Than Bloom Filters

6 years ago (lemire.me)

83 comments

matt_d

They praise bloom filters for being simple and describe how it works in two sentences. They criticize cuckoo filters for being too complicated. They praise their own algorithm for being simple, and then they don't describe it, and instead link to a repo of several hundred lines of code.

My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else. I'm not ashamed to admit that I have several such code blobs in production right now, but I won't pretend to call them simple.

atq2119 6 years ago
The xor filter is more complex, but the concepts behind it are quite accessible. Let's give it a try.
The xor filter queries for a key by computing three hashes h1(k), h2(k), h3(k) of the key k and using those to index into three arrays a1, a2, a3 of K-bit values. The 3 values loaded from those arrays are xor'd together and compared to a fingerprint f(k).[0] If they are equal, the key is assumed to be contained in the filter.
Assuming that the fingerprint function is random, the probability of false positives is 2^-K.
Constructing the xor filter requires choosing the three hash functions at random, and solving the system of linear equations given by:
a1[h1(k_i)] + a2[h2(k_i)] + a3[h3(k_i)] = f(k_i) for i = 1..N
If the arrays are big enough (each one a fraction larger than N/3 where N is the number of elements), then the probability is high that the system has a solution (this comes down to the hyperedges {h1(k_i), h2(k_i), h3(k_i)} being "acyclical" -- the acyclic part would be easier to understand if you only had two hash functions giving you normal undirected edges, but the proof of acyclicity only works with 3 or more hash functions). If it doesn't, you just pick new hash functions at random and retry.
The complicated part of the algorithm is solving the system of linear equations efficiently. The complicated part of the correctness proof is showing that the system has a solution with high probability.
[0] The fingerprint is also a hash function. The difference is that f(k) can be fixed for the algorithm, whereas the h1, h2, h3 need to be chosen randomly when the xor filter is built.
Edit: Note that the paper combines the three arrays into a single large one, but that's an implementation detail. Though it makes you wonder whether one couldn't have three hash functions that each cover the entire range of the larger array and still make it work. That could potentially decrease the required size of the array marginally, at the expense of a more difficult proof.
_visgean 6 years ago

He probably did not talk about it because this post was an announcement of his paper: https://arxiv.org/abs/1912.08258, the XOR filter description starts at page 3.
fulafel 6 years ago

There's also the linked paper that has algorithmic description (starts on p. 3).
keyle 6 years ago

I agree, I kept reading to see their approach and "woof, there it is, so much better, you work it out!". sigh.
Rule #1 of getting something adopted breached.
coldtea 6 years ago
>My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else.
Or, one could just read the code instead of guessing - it's not that hard to follow, similar to Bloom idea + XORing the hashes.
- aeternum 6 years ago
  
  > not that hard to follow
  I'd say that the requirement to solve a system of linear equations that may or may not even have a solution in order to construct the filter might qualify as hard to follow.
zahreeley 6 years ago

Very well said.

HenriNext 6 years ago

Neither the blog nor the repo contain benchmarks, which is bit weird when the title of a technical article has word "Faster".

Then after digging through the actual paper it turns out that the construction performance of Xor Filter is actually massively slower than Bloom.. May not matter for many uses, but critical for others (like mine).

> Blocked Bloom = 10 ns/key

> Bloom 8 = 40 ns/key

> Xor 8 = 110 ns/key

thomasmg 6 years ago
Co-author of the paper here. The repo https://github.com/FastFilter/fastfilter_cpp contains all the source code and benchmarks. Yes, construction of the xor filter is slower, but I wouldn't say massively. You should probably compare Bloom 12 against Xor 8, as those have similar false positive rates. Bloom 12 is 60 ns/key, versus Xor 8 at 110 ns/key. Cuckoo filter is somewhere between. Those number don't include calculating the hash code of a key, which might be another 30 ns/key or so.
What is your use case? Maybe Blocked Bloom might be better (if you have the memory).
- HenriNext 6 years ago
  
  Thanks, found the benchmarks (your blog links primarily to different repo, "/xorfilter").
  My use case is a new kind of debugger (https://BugJail.com), which works by capturing 20+ million events per second from running application, and then reconstructing model of program execution from the raw events. With 20+ million events per second that 110ns per key is just not going to work..
  
  12 replies →
- hinkley 6 years ago
  
  Yes but grandparent is correct. The first word in the title is still “faster”.
  
  2 replies →

nullc 6 years ago

The paper claims that it is more space efficient than golomb coded sequences, but this is clearly not the case at least when the golomb parameters and FP rate are set optimally[1] a GCS can be well within a fraction of 1% of the information theoretic limit.

The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.

Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.

[1] https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc84...

thomasmg 6 years ago
(Co-author here) For GCS, I also found the Golomb-Rice parameter Q-1 to be better, this is what we used. According to my calculation, the overhead of GCS is around 1.5 bit/key. See also https://github.com/0xcb/Golomb-coded-map . Sure, with arithmetic coding, or ANS coding, it should be lower; it would be interesting to see how much. I kind of like GCS.
- nullc 6 years ago
  
  The overhead you're seeing is because you've mismatched your FP rate with the optimal size for your GCS code. If they're exactly matched the overhead is very low (not as low as an arithmetic code simply because each entry is still coded into bits, so you get something like a half-bit overhead on average from that).
  
  5 replies →
senderista 6 years ago
If I were coding a static approximate membership query structure, I wouldn't use either bloom filters or GCS, I'd use an Elias-Fano-coded sequence of fingerprints. That has nearly the same compression efficiency as Golomb coding but can be efficiently queried with no decompression.
- thomasmg 6 years ago
  
  Interesting idea! Elias-Fano monotone sequences unfortunately do need a bit more space than just a list of Rice codes. What is possible, without having to increase the space usage, is to re-arrange the Rice codes in each bucket, so that all variable parts come first, and the fixed parts at the end. That way, lookup speed would be much, much faster for large buckets. We used this for RecSplit: https://arxiv.org/abs/1910.06416 I will try that out for GCS as well!

colinchartier 6 years ago

Direct link to paper (pdf) - https://arxiv.org/pdf/1912.08258.pdf

I haven't fully read the paper yet, but it seems like a variation of a bloom filter where membership is determined by xoring three hashes (see algorithm 1 in the pdf)

aidenn0 6 years ago
If that's the case then both membership and non-membership will be probabilistic, rather than one or the other being certain like boom filters, right?
- colinchartier 6 years ago
  
  "Membership test: returns true if the key x is likely in S, false otherwise"
  I think they circumvent the problem by making a complicated insertion procedure where all of the keys need to be inserted at once, it seems uses a stack to avoid hash collisions
  Take a look at https://github.com/FastFilter/xorfilter/blob/master/xorfilte...
  I can usually read go, but I've never been that great at deciphering "academic code" so the exact details escape me here.
  
  3 replies →

the8472 6 years ago

One of the advantages of bloom filters is that you can also do intersections, unions and cardinality estimations thereof on remote sets. These don't seem to support that.

thomasmg 6 years ago
Yes it is possible to do this, but do people actually use those features? For (pure) cardinality estimation, there is HyperLogLog, which can also be merged...
- jacquesm 6 years ago
  
  > do people actually use those features?
  I did, not sure how common it is though. For much-larger-than-memory problems it can give a tremendous speedup to be able to do an estimate or logical operation(s) before having to hit the disk.
  
  1 reply →

olliej 6 years ago

As far as I can tell it doesn’t support insertion operations. That’s a massive drawback vs both cuckoo and bloom filters.

It’s somewhat disingenuous that they are clearly attempting to position this as being superior to the other data structures, while their’s cannot do one of the core operations.

thomasmg 6 years ago

I think there are many use cases where you don't add entries after construction, for example for log-structured merge trees. You anyway can't add too many, otherwise it will hurt the false positive rate. What _are_ important use case where you add entries afterwards? BTW for cuckoo filters, add and remove can sometimes fail (depending on parameters and luck).
_jal 6 years ago
There are real-world applications where this* would drop in. I believe Chrome's malicious-site blocker employs a Bloom filter blob.
* assuming it does what it says, I haven't looked.
- tialaramex 6 years ago
  
  Assuming you're talking about Google Safe Browsing it needs the ability to update the data structures. Implementations vary but I wouldn't use Bloom Filters and this seems even less appropriate.
  I'm sure there's some application where this is the right choice, but the reaction here shows that people don't like being told the propaganda approach to new data structures. Just returning true is also technically "faster and smaller than Bloom Filters" as the headline for this item says when I write this. Almost useless, but faster and smaller.
  
  1 reply →
kragen 6 years ago

It is absolutely outrageous that you are accusing the authors of this work of dishonesty because of limitations they explicitly called out in their introductory post as well as the paper.
fulafel 6 years ago
The compared Cuckoo filters have a similar property, no? (Depends on luck if you can update it or not)
- nullc 6 years ago
  
  When the table is less full than some cliff-effect-threshold the probability of failure is arbitrarily negligible.
  If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.
  If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.
  [Of course, all assuming someone can't attack your hash function.]
  
  1 reply →
hamilyon2 6 years ago

It would be more fair to compare with minimal perfect hashing, as it does not support insertion too

ganzuul 6 years ago

Also of academic interest; Neural Bloom filter: https://arxiv.org/abs/1906.04304

Achives optimally 36% space saving compared to BL but needs an fp accelerator to achieve the same throughput. Compares to perfect hasing, which also does not suppprt insertions.

senderista 6 years ago

Comparing a static data structure to two well-known dynamic structures is disingenuous, to say the least. And for a static structure you can do better than just a minimal perfect hash table of fingerprints. You can use a succinct data structure to compress the fingerprints in a form that is still efficiently queryable (unlike say Golomb-compressed sequences): https://www.antoniomallia.it/sorted-integers-compression-wit...

powturbo 6 years ago

Elias-fano are overestimated without any proof or comparison benchmarks.
Like other integer compression methods, you still need offsets to small blocks for efficient random access.
Getting a single value in a block is done with slow branchy bit ops like ctz and bit shifting. That's decoding
The fastest elias-fano implementation is in https://github.com/powturbo/TurboPFor and is partially using SIMD

willscott 6 years ago

VLDB also just published a paper on a similar, improved bloom-like data structure "Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters" http://www.vldb.org/pvldb/vol13/p197-wang.pdf

It would be interesting to get a comparison between Xor and Vacuum.

thomasmg 6 years ago

Thanks for the link! I wasn't aware of this. The source code is available at https://github.com/wuwuz/Vacuum-Filter/
To me it looks like a version of the cuckoo filter that isn't restricted to a 2^n size. We made the same change in the benchmark of the xor filter against the cuckoo filter (and others made the same change). There are some more changes, and the paper claims the vacuum filter is faster than the cuckoo filter. I will try to add the vacuum filter to our benchmark suite at https://github.com/FastFilter/fastfilter_cpp and see what results I get.

layoutIfNeeded 6 years ago

Interesting. Where does 3 come from for the number of hash functions? Why not 4 or 5?

fixf 6 years ago
The minimum number motivated to work is also the cheapest. 2 was simply deemed insecure.
- thomasmg 6 years ago
  
  No, actually 2 would also work. It's just that 3 hash functions needs the least space for some reason (less than 2, less than 4 or more).
  
  2 replies →
- layoutIfNeeded 6 years ago
  
  Insecure? What does that mean in the context of a read-only data structure?
  
  1 reply →

lalaland1125 6 years ago

How is this different from a hashset where you simply never do the final key comparison? As far as I can tell both require linear storage space and have around the same storage complexity.

It seems like the main "innovation" here is simply using a perfect hash function for your hashset.

From the paper: "We store the fingerprints in an array B with capacity c slightly larger than the cardinality of the set |S | (i.e., c ≈ 1.23 × |S |)."

aidenn0 6 years ago

The constant factor is different. If you want a small number of collisions, you will need a very small load-factor (you can't do most techniques for handling collisions because you aren't storing the key). So if you treat your hash-table as an array of bits and you run at a load-factor of say .1, then you are now using 10 bits per key, which is a lot more space than a bloom filter or an xor filter with the same false-positive rate.
gnulinux 6 years ago
Bloom filters don't require linear space. You can essentially create a set of arbitrarily large objects, as long as you have a hash for them. It requires O(C) memory where C is the size of your largest hash function. As a trade-off, it's not deterministic, since there can be hash collisions due to pigenhole principle. So, even though bloom filter claims the object is in the set, there is a certain probability (that you can calculate) that it's not actually there. On the other hand, a tree-set or hash-set will be deterministically correct, but they require O(N) space and tree-set requires O(logN) lookup cost.
- saurik 6 years ago
  
  Bloom filters absolutely call for linear space if you are using them appropriately: given the hash collision probability you are working with the size of the required filter is linear in the number of elements you want to store. Yes, it is "technically true" (the most useless kind of true) that you can store as much stuff as you want in the filter... but you get increasingly worse false positive rates that eventually just saturate to near 100%. If you actually read any of the math--including this linked work on Xor filters--you will always see them described in terms of "bits per item stored" (where the number of bits is extremely small vs. what you would expect). Hell: Wikipedia even describes the algorithm's storage requirements as "A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.".
  
  2 replies →

k__ 6 years ago

How big of a problem can false positives get?

In terms of the URL blacklist, is the 1% of false positives 1% of all possible URLs? Or of the amount of requests to the filter? Or what?

ovi256 6 years ago

Of the amount of blacklisted URLs, for sure. The filter doesn't know anything about possible URLs.

ncmncm 6 years ago

The C++ code, anyway, is remarkably prolix, probably 4-10x. It is probably best that they did not attempt Rust, under the circumstances.

eb0la 6 years ago
From the blog post: > It would be relatively easy to start from the Go version and produce a Rust version, but I had to stop somewhere.
(I bet someone in hacker news is saying "challenge accepted!" and we'll have a xorfilter crate very soon)
- ncmncm 6 years ago
  
  Yes, I read that.
  The point was that as bad as their C++ code is, their Rust version might be even more disappointing.
  Downvoting me just for mentioning Rust only demonstrates immaturity.
  
  4 replies →