Comment by antirez

3 hours ago

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

19 comments

antirez

halxc 3 hours ago

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

https://arxiv.org/abs/2504.16046

RyanCavanaugh 3 hours ago
The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.
- philipportner 2 hours ago
  
  This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.
  https://arxiv.org/pdf/2601.02671
  > For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
  
  1 reply →
- mft_ 1 hour ago
  
  (I'm not needlessly nitpicking, as I think it matters for this discussion)
  A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.
  But your overall point still stands, regardless.
- seba_dos1 1 hour ago
  
  > The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.
  The lesson here is that the Internet compresses pretty well.
Aurornis 42 minutes ago

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.
Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.
ben_w 3 hours ago
We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.
The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"
- antirez 3 hours ago
  
  Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.
  
  1 reply →
- tza54j 2 hours ago
  
  We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.
  It is enough to have read even parts of a work for something to be considered a derivative.
  I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
  It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.
  
  1 reply →
- philipportner 2 hours ago
  
  Granted, these are some of the most widely spread texts, but just fyi:
  https://arxiv.org/pdf/2601.02671
  > For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
  
  1 reply →
- boroboro4 3 hours ago
  
  While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)
soulofmischief 2 hours ago
The point is that it's a probabilistic knowledge manifold, not a database.
- PunchyHamster 2 hours ago
  
  we all know that.
  
  1 reply →

PunchyHamster 2 hours ago

So it will copy most code with adding subtle bugs