Comment by ThatPlayer
3 days ago
I've done something similar with a static hosted site I'm working on. I opted to not reinvent the wheel, and just use WASM Sqlite in the browser. Sqlite already splits the database into fixed-size pages, so the driver using HTTP Range Requests can download only the required pages. Just have to make good indexes.
I can even use Sqlite's full-text search capabilities!
How would that scale to 10TB+ of plain text though? Presumably the indexes would be many gigabytes, especially with full text search.
The client only needs to get indexes for the specific search; if the index is just a list of TF-IDF term scores per document (which gets you a very reasonable start on search relevance) some extremely back-of-the-envelope math leads me to guess at an upper bound in the low tens of megabytes per (non-stopword) term, which seems doable for a client to download on demand.
I wonder if you could take this one step further and have opaque queries using homomorphic encryption on the index and then somehow extracting ranges around the document(s) you're interested in
Inspired by: "Show HN: Read Wikipedia privately using homomorphic encryption" https://news.ycombinator.com/item?id=31668814
Super interesting.