Comment by TeMPOraL

1 month ago

There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.

6 comments

TeMPOraL

VonGallifrey 1 month ago

> turns out plenty of people say they write for others to read

LLMs are not people. They don't write blogs so that a company can profit from their writing by training LLMs on it. They write for others to read their ideas.

TeMPOraL 1 month ago
LLMs aren't making their owners money by just idling on datacenters worth of GPU. They're making money by being useful for users that pay for access. The knowledge and insights from writings that go into training data all end up being read by people directly, as well as inform even more useful output and work benefiting even more people.
- wtetzner 1 month ago
  
  Except the output coming from an LLM is the LLM's take on it, not the original source material. It's not the same thing. Not all writing is simply a collection of facts.
- philipwhiuk 1 month ago
  
  And rarely cite their sources, thus affording the author not so much a crumb of benefit in kind.
  
  1 reply →

username223 1 month ago

> turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

I couldn't care less about "tracking and controlling the audience," but I have no interest in others using my words and photos to profit from slop generators. I make that clear in robots.txt and licenses, but they ignore both.