Comment by rzmmm

1 day ago

The model has multiple layers of mechanisms to prevent carbon copy output of the training data.

Do you have a source for this?

Carbon copy would mean over fitting

  • I saw weird results with Gemini 2.5 Pro when I asked it to provide concrete source code examples matching certain criteria, and to quote the source code it found verbatim. It said it in its response quoted the sources verbatim, but that wasn't true at all—they had been rewritten, still in the style of the project it was quoting from, but otherwise quite different, and without a match in the Git history.

    It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.

    • LLM's are not archives of information.

      People seem to have this belief, or perhaps just general intuition, that LLMs are a google search on a training set with a fancy language engine on the front end. That's not what they are. The models (almost) self avoid copyright, because they never copy anything in the first place, hence why the model is a dense web of weight connections rather than an orderly bookshelf of copied training data.

      Picture yourself contorting your hands under a spotlight to generate a shadow in the shape of a bird. The bird is not in your fingers, despite the shadow of the bird, and the shadow of your hand, looking very similar. Furthermore, your hand-shadow has no idea what a bird is.

      1 reply →

  • Source is just read the definition of what "temperature" is.

    But honestly source = "a knuckle sandwich" would be appropriate here.

    • Threatening violence*, even in this virtual way and encased in quotation marks, is not allowed here.

      Edit: you've been breaking the site guidelines badly in other threads as well. (To pick one example of many: https://news.ycombinator.com/showhn.html

      * it would be more accurate to say "using violent language as a trope in an argument" - I don't believe in taking comments like this literally, as if they're really threatening violence. Nonetheless you can't post this way to HN.

forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"

  • It's mind boggling if you think about the fact they're essential "just" statistical models

    It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth

  • That might be somewhat ungenerous unless you have more detail to provide.

    I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.

    • So it would be able to produce the training data but with sufficient changes or added magic dust to be able to claim it as one's own.

      Legally I think it works, but evidence in a court works differently than in science. It's the same word but don't let that confuse you and don't mix them both.

  • The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

  • Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted

    • It's not the searching that's infeasible. Efficient algorithms for massive scale full text search are available.

      The infeasibility is searching for the (unknown) set of translations that the LLM would put that data through. Even if you posit only basic symbolic LUT mappings in the weights (it's not), there's no good way to enumerate them anyway. The model might as well be a learned hash function that maintains semantic identity while utterly eradicating literal symbolic equivalence.

does it?

this is a verbatim quote from gemini 3 pro from a chat couple of days ago:

"Because I have done this exact project on a hot water tank, I can tell you exactly [...]"

I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...