Comment by squigz

1 day ago

The mechanism is putting some text that points to the downloads.

So perhaps it's time to standardize that.

  • I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?

    • There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.

      This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.

      --

      [0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.

      [1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).

      7 replies →

  • I'm in favor of /.well-known/[ai|llm].txt or even a JSON or (gasp!) XML.

    Or even /.well-known/ai/$PLATFORM.ext which would have the instructions.

    Could even be "bootstrapped" from /robots.txt