Comment by simonw

5 days ago

One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt

16 comments

simonw

conradev 5 days ago

They documented it in 2015: https://www.macrumors.com/2015/05/06/applebot-web-crawler-si...

dijit 5 days ago

Uncharitable.

Robots.txt is already the understood mechanism for getting robots to avoid scraping a website.

simonw 5 days ago
People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!
- lxgr 5 days ago
  
  That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories.
  Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough.
- wat10000 5 days ago
  
  If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.
  
  8 replies →
pjmlp 4 days ago

Assuming well behaved robots.