Comment by simonw
5 days ago
One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt
5 days ago
One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt
They documented it in 2015: https://www.macrumors.com/2015/05/06/applebot-web-crawler-si...
Uncharitable.
Robots.txt is already the understood mechanism for getting robots to avoid scraping a website.
People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!
That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories.
Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough.
If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.
8 replies →
Assuming well behaved robots.