← Back to context

Comment by saurik

6 years ago

Bloom filters absolutely call for linear space if you are using them appropriately: given the hash collision probability you are working with the size of the required filter is linear in the number of elements you want to store. Yes, it is "technically true" (the most useless kind of true) that you can store as much stuff as you want in the filter... but you get increasingly worse false positive rates that eventually just saturate to near 100%. If you actually read any of the math--including this linked work on Xor filters--you will always see them described in terms of "bits per item stored" (where the number of bits is extremely small vs. what you would expect). Hell: Wikipedia even describes the algorithm's storage requirements as "A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.".

I disagree. It requires linear space in the sense that you need another data structure to deal with hash collisions. If you read my comment I was talking about exactly that. In normal applications of bloom filter it's built before a cache layer to ask if your data is in the cache. For this you do not need a list of objects since if bloom filter wrongly determined that object is in the set, it will be a cache miss and as long as a bloom filter operation is faster than a cache miss this is usually a good tradeoff; because normally when you're about to have a cache miss, you end up not going to cache because bloom filter says the object is not there. For this application, you do NOT need O(N) space inside the bloom filter layer, but obviously eventually someone will need to go retrieve this data.

> If you actually read any of the math

I mean I did and I was trying to describe that math above. I built bloom filters many times before. I'm curious if you misread my comment.

  • You said the following, so I assume you agree with it:

    > It requires O(C) memory where C is the size of your largest hash function.

    For any given expected false-positive rate P, C must be chosen to be a size of O(N), so a bloom filter with expected false-positive rate P will be of size O(N).