Comment by senko
15 hours ago
A full up-to-date index of the searchable web should be a public commons good.
This would not only allow better competition in search, but fix the "AI scrapers" problem: No need to scrape if the data has already been scraped.
Crawling is technically a solved problem, as witnessed by everyone and their dog seemingly crawling everything. If pooled together, it would be cheaper and less resource intensive.
The secret sauce is in what happens afterwards, anyway.
Here's the idea in more detail: https://senkorasic.com/articles/ai-scraper-tragedy-commons
I'm under no illusion something like that will happen .. but it could.
Isn't this what CommonCrawl are doing?
https://commoncrawl.org/
Yes. But they don't crawl everything (probably due to lack of funding), and, as the article and other commenters here note, people are incentivised to allow Google and only Google to crawl. In practice, the CommonCrawl dataset is too small for a realistic search engine competitor.
I'd love to see Google, Bing and others being incentivized (wink, wink) to contribute (technically, financially, etc) to CommonCrawl or Internet Archive since they already do this.
Is crawling really solved?
Any naive crawler is going to run into the problem that servers can give different responses to different clients which means you can show the crawler something different to what you show real users. That turns crawling into an antagonistic problem where the crawler developers need to continually be on the lookout for new ways of servers doing malicious things that poison/mislead the index.
Otherwise you'll return junk spam results from spammers that lied to the crawler.
I've never done it so maybe it's easier than I imagine but I wouldn't be quick to assume that crawling is solved.
I don't mean to say it's trivial. I'm sure there are many hard problems such as the one you mention - though that particular one is more "cleaning the index" part which might work on top of the open common corpus.
But my impression is that it's more a question of scale and engineering time than having to invent something new.
(disclaimer: I also never worked on a internet-scale search system, maybe I'm very off the bat here as well).
Oh, ok. I misunderstood - I think we agree.