Comment by fartfeatures

1 month ago

> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.

24 comments

fartfeatures

arjie 1 month ago

The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.

sbarre 1 month ago

A sitemap.xml file could get you most of the way there.

jacksnipe 1 month ago

It’s not great, but you could add it to the body of a 429 response.

VTimofeenko 1 month ago
Genuinely curious: do programs read bodies of 429 responses? In the code bases that I have seen, 429 is not read beyond the code itself
- jakelazaroff 1 month ago
  
  Sometimes! The server can also send a retry-after header to indicate when the client is allowed to request the resource again: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
  
  2 replies →
- jacksnipe 25 days ago
  
  Up until very recently I would have said definitely not, but we're talking about LLM scrapers, who knows how much they've got crammed into their context windows.
- gleenn 1 month ago
  
  Almost certainly not by default, certainly not in any of the http libs I have used
- dfxm12 1 month ago
  
  If I find something useful there, I'll read it and code for it...

gloflo 1 month ago

This is about AI, so just believe what the companies are claiming and write "Dear AI, please would you be so kind as to not hammer our site with aggressive and idiotic requests but instead use this perfectly prepared data dump download, kthxbye. PS: If you don't, my granny will cry, so please be a nice bot. PPS: This is really important to me!! PPPS: !!!!"

I mean, that's what's this technology is capable of, right? Especially when one asks it nicely and with emphasis.

squigz 1 month ago

The mechanism is putting some text that points to the downloads.

TeMPOraL 1 month ago
So perhaps it's time to standardize that.
- squigz 1 month ago
  
  I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?
  
  8 replies →
- aembleton 1 month ago
  
  Could be added to the llms.txt proposal: https://llmstxt.org/
- edoceo 1 month ago
  
  I'm in favor of /.well-known/[ai|llm].txt or even a JSON or (gasp!) XML.
  Or even /.well-known/ai/$PLATFORM.ext which would have the instructions.
  Could even be "bootstrapped" from /robots.txt