Comment by arjie

1 month ago

Someone convinced me last time[0] that these aren't the well-known scrapers we know but other actors. We wouldn't be able to tell, certainly. I'd like to help the scrapers be better about reading my site, but I get why they aren't.

I wish there were an established protocol for this. Say a $site/.well-known/machine-readable.json that instructs you on a handful of established software or allows pointing to an appropriate dump. I would gladly provide that for LLMs.

Of course this doesn't solve for the use-case where the AI companies are trying to train their models on how to navigate real world sites so I understand it doesn't solve all problems, but one of the things I think I'd like in the future is to have my own personal archive of the web as I know it (Internet Archive is too slow to browse and has very tight rate-limits) and I was surprised by how little protocol support there is for robots.

robots.txt is pretty sparse. You can disallow bots and this and that, but what I want to say is "you can get all this data from this git repo" or "here's a dump instead with how to recreate it". Essentially, cooperating with robots is currently under-specified. I understand why: almost all bots have no incentive to cooperate so webmasters do not attempt to. But it would be cool to be able to inform the robots appropriately.

To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

0: https://news.ycombinator.com/item?id=46352723

9 comments

arjie

saaaaaam 1 month ago

As referenced in the article, there absolutely is an alternative.

https://metabrainz.org/datasets

Linked to from the homepage as “datasets”.

I may be too broadly interpreting what you mean by “machine-communicable” in the context of AI scraping though.

arjie 1 month ago
Well, imagine the best case and that you're a cooperative bot writer who does not intend to harm website owners. Okay, so you follow robots.txt and all that. That's straightforward.
But it's not like you're writing a "metabrainz crawler" and a "metafilter crawler" and a "wiki.roshangeorge.dev crawler". You're presumably trying to write a general Internet crawler. You encounter a site that is clearly a HTTP view into some git repo (say). How do you know to just `git clone` the repo in order to have the data archived as opposed to just browsing the HTTP view.
As you can see, I've got a lot of crawlers on my blog as well, but it's a mediawiki instance. I'd gladly host a Mediawiki dump for them to take, but then they'd have to know this was a Mediawiki-based site. How do I tell them that? The humans running the program don't know my site exists. Their bot just browses the universe and finds links and does things.
In the Metabrainz case, it's not like the crawler writer knows Metabrainz even exists. It's probably just linked somewhere in the web the crawler is exploring. There's no "if Metabrainz, do this" anywhere in there.
The robots.txt is a bit of a blunt-force instrument, and friendly bot writers should follow it. But assuming they do, there's no way for them to know that "inefficient path A to data" is the same as "efficient path B to data" if both are visible to their bot unless they write a YourSite-specific crawler.
What I want is to have a way to say "the canonical URL for the data on A is at URL B; you can save us both trouble by just fetching B". In practice, none of this is a problem for me. I cache requests at Cloudflare, and I have Mediawiki caching generated pages, so I can easily weather the bot traffic. But I want to enable good bot writers to save their own resources. It's not reasonable for me to expect them to write a me-crawler, but if there is a format to specify the rules I'm happy to be compliant.
- saaaaaam 25 days ago
  
  Right, yes, I see your point. I was thinking more from the point of view of "using AI to explore and then write custom scrapers where relevant" rather than just blanket scraping. But you're right - at the scale we're talking, it's presumably just blunt-force "point-and-go" scraping, rather than anything more nuanced.
  The point you make about having some sort of indicator that scrapers can follow to scrape in an optimal way (or access a dump) makes a lot of sense for people who want their content to be ingested by AI.

squigz 1 month ago

> To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

Why does there have to be a "machine-communicable way"? If these developers cared about such things they would spend 20 seconds looking at this page. It's literally one of the first links when you Google "metabrainz"

https://metabrainz.org/datasets

what 1 month ago
You expect the developers of a crawler to look at every site they crawl and develop a specialized crawler for them? That’s fine if you’re only crawling a handful of sites, but absolutely insane if you’re crawling the entire web.
- wtetzner 1 month ago
  
  Isn't the point of AI that it's good at understanding content written for humans? Why can't the scrapers run the homepage through an LLM to detect that?
  I'm also not sure why we should be prioritizing the needs of scraper writers over human users and site operators.
  
  2 replies →
- j16sdiz 1 month ago
  
  if you are crawling the entire web, you should respect robots.txt and don't fetch anything disallowed. full stop.