Comment by pauldix
4 days ago
I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...
It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.
I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).
There's also code online here: https://github.com/bingmann/cobs
No comments yet
Contribute on Hacker News ↗