Comment by 6510
4 days ago
I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.
Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.
A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.
a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw
The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.
I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.
Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e
Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.
But it works and loading a hand full of files or doing an AND is amazingly fast.
What you've described here is an n-gram inverted index (with n = 2) represented as a bitset. We could call it a bigram bitset inverted index. Glad to know you designed and implemented all of this from first principles, and that it serves you well!
I had a problem where we needed to compare large data sets between machines for keys that existed in both, and the bandwidth cost just wasn’t mathing for the median result set size. I was trying to figure out how to send a fingerprint from machine A to B, then have machine B send the hits back. Or how many round trips I could do based on set size to minimize bandwidth + latency. I ended up with a calculus problem nobody could help me solve because of an n^5 term.
My boss was generally pretty good with obscure data structures but neither of us had encountered Bloom filters. This was about three years after Google published their paper on how they were using Bloom filters, but that company would be bankrupt before I figured it out.