Comment by thunky

2 days ago

Have you caught these models violating copyright in responses?

Or are you saying that learning is a violation of copyright?

Researchers have. The idea that the data is unrecoverable after training is incorrect.

"Extracting books from production language models

While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement.

We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall).

For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%)."

https://arxiv.org/abs/2601.02671

Learning isn't. Models are not learning, it's just a metaphor for the lack of better words to describe the process of ingesting data and adjusting weights accordingly.

My point is, they took all this data for free without paying the authors and crammed it into the models. And once it's inside the model the proof of copyright violation disappears.