← Back to context

Comment by wat10000

6 days ago

If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.

Yes, but given the lack of generic "robot types" (e.g. "allow algorithmic search crawlers, allow archival, deny LLM training crawlers"), neither opt-in nor opt-out seems like a particularly great option in an age where new crawlers are appearing rapidly (and often, such as here, are announced only after the fact).

Sure, but I still think it's OK to look at Apple with a raised eyebrow when they say "and our previously secret training data crawler obeys robots.txt so you can always opt out!"

  • I've been online since before the web existed, and this is the first time I've ever seen this idea of some implicit obligation to give people advance notice before you deploy a crawler. Looks to me like people are making up new rules on the fly because they don't like Apple and/or LLMs.

    • I stand by what I said.

      Apple are saying you can opt out of their training data collection using robots.txt.

      But... they collected their training data before they told people how to opt out.

      I don't understand why me pointing that out as "eyebrow raising" is controversial here.

      4 replies →