Comment by pllbnk
2 days ago
So, what happened here is they stole all public works ever created and took them for themselves. All copyright laws were ignored without any repercussions, all the litigations of the past (Aaron Swartz for example) have been ignored. And they use it to enrich themselves by re-selling those public works after applying a lossy compression algorithm to prevent them being exact replicas. If the public hypothetically agreed and allowed this, then those models would have to all become public and given back to the people.
The idea of "all the public works ever created" is easily contested. Not every work has been "published", let alone scanned, digitised or published to the internet
The marketing for "AI" uses phrases like "the sum of all human knowledge" to refer to what has been used to create "models". The assumed irrelevance of non-published, "private" works is dubious if not absurd
The internet now allows potentially anyone to publish anything, e.g., via personal websites, social media pages, etc. But that doesnt mean everyone partakes. How much of the unfiltered garbage published by those who do has been used to create these "models"
"AI" companies will not reveal exactly what "works" were used to create the "models"
I'm not commenting above on the the question of "fair use" or about the tragedy of Aaron Swartz, I'm commenting on the word "all", i.e., the hype
But if I were going to comment on Swartz I would ask first whether the "AI" models are trained on the contents of JSTOR, or the contents of PACER (that are not being shared on the internet for free)
Otherwise, the comparison is difficult to make, IMHO
For example, with respect to any materials from JSTOR, the "stealing" was done by the pirate library contributors, not the "AI" companies not the "AI" companies. And with respect to PACER, the "stealing" by Swartz was, technically, done from government computers
If readers are into "above the law" consipracy theories about "AI" companies, check out the bizarre story of the OpenAI employee who was the document custodian witness for the plaintffs in the NYTimes copyright litigation. Committed suicide before testifying
> The idea of "all the public works ever created" is easily contested.
Hence the word "public," implying that they are published and accessible.
> The internet now allows potentially anyone to publish anything, e.g., via personal websites, social media pages, etc. But that doesnt mean everyone partakes. How much of the unfiltered garbage published by those who do has been used to create these "models"
This seems like a nitpick instead of actually responding to the idea that they have stolen massive amounts of other peoples' work and are using it to enrich themselves. And the stealing is ignored or given a slap-on-the-wrist fine, which is not how it has worked for numerous other people in the past (the example being Aaron Schwartz). It's kind of irrelevant if the models do or do not train on low-effort text on the internet.
Have you caught these models violating copyright in responses?
Or are you saying that learning is a violation of copyright?
Researchers have. The idea that the data is unrecoverable after training is incorrect.
"Extracting books from production language models
While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement.
We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall).
For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%)."
https://arxiv.org/abs/2601.02671
Learning isn't. Models are not learning, it's just a metaphor for the lack of better words to describe the process of ingesting data and adjusting weights accordingly.
My point is, they took all this data for free without paying the authors and crammed it into the models. And once it's inside the model the proof of copyright violation disappears.
Torrenting hundreds of thousands of published works to train your AI model is a violation of copyright.