Comment by ricardo81
2 months ago
Fetching web pages at the kind of volume needed to keep the index fresh is a problem, unless you're Googlebot. It requires manual intervention with whitelisting yourself with the likes of Cloudflare, cutting deals with the likes of Reddit and getting a good reputation with any other kind of potential bot blocking software that's unfamiliar with your user agent. Even then, you may still find yourself blocked from critical pieces of information.
No, I think we can get by with using CommonCrawl, pulling every few months the fresh content and updating the search stubs. The idea is you don't change the entry points often, you open them up when you need to get the fresh content.
Imagine this stack: local LLM, local search stub index, and local code execution sandbox - a sovereign stack. You can get some privacy and independence back.
CC is not on the same scale as Google and not nearly as fresh. It's around 100th of the size and not much chance of having recent versions of a page.
I imagine you'd get on just fine for short tail queries but the other cases (longer tail, recent queries, things that haven't been crawled) begin to add up.