Comment by e12e

1 day ago

Seems to gloss over other kinds of contamination, beyond GPL code. Code from pirated text books, the problem with the entire language model being trained on copyright data, and on the possibility of the training data containing various copyrighted code.

6 comments

e12e

embedding-shape 1 day ago

> Code from pirated text books

Anthropic "solved" this by intermingling the texts extracted from pirated books (illegal) with texts extracted from the physical books they bought and destroyed (legal), so no one can clearly say if the copyrighted material it spits out came from a legal source or not. Everyone rejoiced.

thyrsus 14 hours ago

I've seen copyright notices that explicitly forbid use for AI training. Would this "transformation" argument still hold in such cases?
For example:
No Generative AI Training Use
For avoidance of doubt, Author reserves the rights, and grants no rights to, reproduce and/or otherwise use the Work in any manner for purposes of training artificial intelligence or machine learning technologies to generate text, text to speech, voice, or audio including without limitation, technologies that are capable of generating works in the same style or genre as the Work, unless individual or entity obtains Author’s specific and express permission to do so. Nor does any individual or entity have the right to sublicense others to reproduce and/or otherwise use the Work in any manner for the purposes of training artificial intelligence or machine learning technologies to generate text, text to speech, voice, or audio without Author’s specific and express permission.
e12e 21 hours ago
> books they bought and destroyed (legal)
They're only legal if training is fair use - and even I don't think it's immediately clear what would be the legal status of verbatim regurgitation of code in copyright, or code protected by patents?
AFAIK I (as a human developer) can't assume that I can go and copy code out of a text book, and then assume copyright and charge for a license to it?
- embedding-shape 21 hours ago
  
  > They're only legal if training is fair use
  The judge seems to have said it's because they "transformed" the books (destroying them after digitalizing) in the process, that made it legal.
  > Ultimately, Judge William Alsup ruled that this destructive scanning operation qualified as fair use—but only because Anthropic had legally purchased the books first, destroyed each print copy after scanning, and kept the digital files internally rather than distributing them. The judge compared the process to “conserv[ing] space” through format conversion and found it transformative. - https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...
  
  1 reply →
senaevren 1 day ago

The intermingling argument is actually central to the Bartz settlement structure. The settlement required destruction of the pirated dataset specifically because commingled training data creates an unresolvable provenance problem. For deployers building on Claude, EDPB Opinion 28/2024 requires a documented assessment of the foundation model's training data legal basis before deployment. "We cannot tell which outputs came from which source" is not a satisfactory answer to a regulator running that assessment. wrote about it before here: https://legallayer.substack.com/p/i-read-every-edpb-document...