Comment by walls

18 hours ago

A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.

Are these websites not serving public content? If there's some legal concerns just create a separate scraping LLC that fakes user agent and uses residential IPs or VPN or something. I can't imagine that the companies would follow through with some sort of lawsuit against a scraper that's trying to index their site to get them more visitors, if they allow GoogleBot.

> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.

I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?

  • You can read https://hackernoon.com/the-long-now-of-the-web-inside-the-in... it was a nice look into their infra structure. One could theoretically build it. A few things stand out:

    1. IIUC depends a lot on "Save Page Now" democratization, which could work, but its not like a crawler.

    2. In absence of alexa they depend quite heavily on common crawl, which is quite crazy because there literally is no other place to go. I don't think they can use google's syndicated API, cause they would then start showing ads in their database, which is garbage that would strain their tiny storage budget.

    3. Minor from a software engineering perspective but important for survival of the company: since they are an artifact of record storage, to convert that to an index would need a good legal team to battle google to argue. They do that the DoJ's recent ruling in their favor.

I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.

  • I think someone had commented on another thread about SerpAPI the other day that common crawl is quite small. It would be a start, I think the key to a good index people will use is freshness of the results. You need good recall for a search engine, precision tuning/re-ranking is not going to help otherwise.

If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.

  • There is a logistics problem here - even if you had enough money to pay, how would you get in touch with every single site to even let them know you're happy to pay? It's not like site operators routinely scan their error logs to see your failed crawling attempts and your offer in the user-agent.

    Even if they see it, it's a classic chicken & egg problem: it's not worth the time of the site operator to engage with your offer until your search engine popular enough to matter, but your search engine will never become popular enough to matter if it doesn't have a critical mass of sites to begin with.

    • Realistically you don't need every single site on board before you index becomes valuable. You can get in touch with sites via social media, email, discord, or even visiting them face to face.

      3 replies →