Comment by saaaaaam

1 month ago

As referenced in the article, there absolutely is an alternative.

Linked to from the homepage as “datasets”.

I may be too broadly interpreting what you mean by “machine-communicable” in the context of AI scraping though.

2 comments

saaaaaam

Well, imagine the best case and that you're a cooperative bot writer who does not intend to harm website owners. Okay, so you follow robots.txt and all that. That's straightforward.

But it's not like you're writing a "metabrainz crawler" and a "metafilter crawler" and a "wiki.roshangeorge.dev crawler". You're presumably trying to write a general Internet crawler. You encounter a site that is clearly a HTTP view into some git repo (say). How do you know to just `git clone` the repo in order to have the data archived as opposed to just browsing the HTTP view.

As you can see, I've got a lot of crawlers on my blog as well, but it's a mediawiki instance. I'd gladly host a Mediawiki dump for them to take, but then they'd have to know this was a Mediawiki-based site. How do I tell them that? The humans running the program don't know my site exists. Their bot just browses the universe and finds links and does things.

In the Metabrainz case, it's not like the crawler writer knows Metabrainz even exists. It's probably just linked somewhere in the web the crawler is exploring. There's no "if Metabrainz, do this" anywhere in there.

The robots.txt is a bit of a blunt-force instrument, and friendly bot writers should follow it. But assuming they do, there's no way for them to know that "inefficient path A to data" is the same as "efficient path B to data" if both are visible to their bot unless they write a YourSite-specific crawler.

What I want is to have a way to say "the canonical URL for the data on A is at URL B; you can save us both trouble by just fetching B". In practice, none of this is a problem for me. I cache requests at Cloudflare, and I have Mediawiki caching generated pages, so I can easily weather the bot traffic. But I want to enable good bot writers to save their own resources. It's not reasonable for me to expect them to write a me-crawler, but if there is a format to specify the rules I'm happy to be compliant.

saaaaaam 1 month ago

Right, yes, I see your point. I was thinking more from the point of view of "using AI to explore and then write custom scrapers where relevant" rather than just blanket scraping. But you're right - at the scale we're talking, it's presumably just blunt-force "point-and-go" scraping, rather than anything more nuanced.
The point you make about having some sort of indicator that scrapers can follow to scrape in an optimal way (or access a dump) makes a lot of sense for people who want their content to be ingested by AI.