Comment by visarga
2 months ago
No, I think we can get by with using CommonCrawl, pulling every few months the fresh content and updating the search stubs. The idea is you don't change the entry points often, you open them up when you need to get the fresh content.
Imagine this stack: local LLM, local search stub index, and local code execution sandbox - a sovereign stack. You can get some privacy and independence back.
CC is not on the same scale as Google and not nearly as fresh. It's around 100th of the size and not much chance of having recent versions of a page.
I imagine you'd get on just fine for short tail queries but the other cases (longer tail, recent queries, things that haven't been crawled) begin to add up.