Comment by simonw

6 days ago

One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt

Uncharitable.

Robots.txt is already the understood mechanism for getting robots to avoid scraping a website.

  • People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!

    • That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories.

      Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough.

    • If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.

      8 replies →