Comment by tomthe

4 days ago

I wonder if you could implement it with only static hosting?

We would need to split the index into a lot of smaller files that can be practically downloaded by browsers, maybe 20 MB each. The user types in a search query, the browser hashes the query and downloads the corresponding index file which contains only results for that hashed query. Then the browser sifts quickly through that file and gives you the result.

Hosting this would be cheap, but the main barriers remain..

I've done something similar with a static hosted site I'm working on. I opted to not reinvent the wheel, and just use WASM Sqlite in the browser. Sqlite already splits the database into fixed-size pages, so the driver using HTTP Range Requests can download only the required pages. Just have to make good indexes.

I can even use Sqlite's full-text search capabilities!

  • How would that scale to 10TB+ of plain text though? Presumably the indexes would be many gigabytes, especially with full text search.

    • The client only needs to get indexes for the specific search; if the index is just a list of TF-IDF term scores per document (which gets you a very reasonable start on search relevance) some extremely back-of-the-envelope math leads me to guess at an upper bound in the low tens of megabytes per (non-stopword) term, which seems doable for a client to download on demand.

  • I wonder if you could take this one step further and have opaque queries using homomorphic encryption on the index and then somehow extracting ranges around the document(s) you're interested in

    Inspired by: "Show HN: Read Wikipedia privately using homomorphic encryption" https://news.ycombinator.com/item?id=31668814