Comment by olliej
6 years ago
As far as I can tell it doesn’t support insertion operations. That’s a massive drawback vs both cuckoo and bloom filters.
It’s somewhat disingenuous that they are clearly attempting to position this as being superior to the other data structures, while their’s cannot do one of the core operations.
I think there are many use cases where you don't add entries after construction, for example for log-structured merge trees. You anyway can't add too many, otherwise it will hurt the false positive rate. What _are_ important use case where you add entries afterwards? BTW for cuckoo filters, add and remove can sometimes fail (depending on parameters and luck).
There are real-world applications where this* would drop in. I believe Chrome's malicious-site blocker employs a Bloom filter blob.
* assuming it does what it says, I haven't looked.
Assuming you're talking about Google Safe Browsing it needs the ability to update the data structures. Implementations vary but I wouldn't use Bloom Filters and this seems even less appropriate.
I'm sure there's some application where this is the right choice, but the reaction here shows that people don't like being told the propaganda approach to new data structures. Just returning true is also technically "faster and smaller than Bloom Filters" as the headline for this item says when I write this. Almost useless, but faster and smaller.
I’m unsure if they do partial/delta updates — I’ve never looked at the spec in detail - but I don’t know how we’ll bloom filters deal with bulk delta updates.
It is absolutely outrageous that you are accusing the authors of this work of dishonesty because of limitations they explicitly called out in their introductory post as well as the paper.
The compared Cuckoo filters have a similar property, no? (Depends on luck if you can update it or not)
When the table is less full than some cliff-effect-threshold the probability of failure is arbitrarily negligible.
If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.
If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.
[Of course, all assuming someone can't attack your hash function.]
(Co-author of the paper.) You are right. However, it is a bit hard to find the threshold. Even the original author of the cuckoo filter didn't get it right, see https://github.com/efficient/cuckoofilter/issues/9 - for this reason we also lowered the maximum load to 0.94. However, I found (running the benchmarks) that sometimes that is not sufficient (insertion still fails). So, where is the cliff exactly? Is it 0.9? Or lower? And, as you say, it also depends on the hash function. We have used the Murmur3 finalizer, which is statistically very good. What if you use a weaker hash function? For a regular cuckoo hash table, you can just re-build it. But for a cuckoo filter rebuild is not easily possible as you don't have the keys.
You are also right regarding augmenting the cuckoo filter (e.g. "cuckoo hashing with a stash"), but it slows things down. Or use a different algorithm to insert (breath first I think). But it would add complexity.
I think, as is, the cuckoo filter is quite a tricky data structure: it can do a lot in theory, but it is very tricky to get right.
It would be more fair to compare with minimal perfect hashing, as it does not support insertion too