Comment by magneticnorth

1 day ago

I think that was Tao's point, that the new proof was not just read out of the training set.

I don't think it is dispositive, just that it likely didn't copy the proof we know was in the training set.

A) It is still possible a proof from someone else with a similar method was in the training set.

B) something similar to erdos's proof was in the training set for a different problem and had a similar alternate solution to chatgpt, and was also in the training set, which would be more impressive than A)

  • It is still possible a proof from someone else with a similar method was in the training set.

    A proof that Terence Tao and his colleagues have never heard of? If he says the LLM solved the problem with a novel approach, different from what the existing literature describes, I'm certainly not able to argue with him.

    • > A proof that Terence Tao and his colleagues have never heard of?

      Tao et al. didn't know of the literature proof that started this subthread.

      8 replies →

  • Does it matter if it copied or not? How the hell would one even define if it is a copy or original at this point?

    At this point the only conclusion here is: The original proof was on the training set. The author and Terence did not care enough to find the publication by erdos himself

The model has multiple layers of mechanisms to prevent carbon copy output of the training data.

  • Do you have a source for this?

    Carbon copy would mean over fitting

    • I saw weird results with Gemini 2.5 Pro when I asked it to provide concrete source code examples matching certain criteria, and to quote the source code it found verbatim. It said it in its response quoted the sources verbatim, but that wasn't true at all—they had been rewritten, still in the style of the project it was quoting from, but otherwise quite different, and without a match in the Git history.

      It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.

      2 replies →

  • forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"

    • It's mind boggling if you think about the fact they're essential "just" statistical models

      It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth

      12 replies →

    • That might be somewhat ungenerous unless you have more detail to provide.

      I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.

      4 replies →

    • The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

  • does it?

    this is a verbatim quote from gemini 3 pro from a chat couple of days ago:

    "Because I have done this exact project on a hot water tank, I can tell you exactly [...]"

    I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...