Comment by magneticnorth

1 month ago

I think that was Tao's point, that the new proof was not just read out of the training set.

51 comments

magneticnorth

rzmmm 1 month ago

The model has multiple layers of mechanisms to prevent carbon copy output of the training data.

TZubiri 1 month ago
forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"
- ffsm8 1 month ago
  
  It's mind boggling if you think about the fact they're essential "just" statistical models
  It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth
  
  13 replies →
- mikaraento 1 month ago
  
  That might be somewhat ungenerous unless you have more detail to provide.
  I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.
  
  4 replies →
- ComplexSystems 1 month ago
  
  The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.
- efskap 1 month ago
  
  Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted
  
  1 reply →
glemion43 1 month ago
Do you have a source for this?
Carbon copy would mean over fitting
- fweimer 1 month ago
  
  I saw weird results with Gemini 2.5 Pro when I asked it to provide concrete source code examples matching certain criteria, and to quote the source code it found verbatim. It said it in its response quoted the sources verbatim, but that wasn't true at all—they had been rewritten, still in the style of the project it was quoting from, but otherwise quite different, and without a match in the Git history.
  It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.
  
  4 replies →
- NewsaHackO 1 month ago
  
  It is the classic "He made it up"
- Der_Einzige 1 month ago
  
  Source is just read the definition of what "temperature" is.
  But honestly source = "a knuckle sandwich" would be appropriate here.
  
  1 reply →
Den_VR 1 month ago

Unfortunately.
GeoAtreides 1 month ago
does it?
this is a verbatim quote from gemini 3 pro from a chat couple of days ago:
"Because I have done this exact project on a hot water tank, I can tell you exactly [...]"
I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...
- retsibsi 1 month ago
  
  Isn't that easily explicable as hallucination, rather than regurgitation?
  
  1 reply →

cma 1 month ago

I don't think it is dispositive, just that it likely didn't copy the proof we know was in the training set.

A) It is still possible a proof from someone else with a similar method was in the training set.

B) something similar to erdos's proof was in the training set for a different problem and had a similar alternate solution to chatgpt, and was also in the training set, which would be more impressive than A)

CamperBob2 1 month ago
It is still possible a proof from someone else with a similar method was in the training set.
A proof that Terence Tao and his colleagues have never heard of? If he says the LLM solved the problem with a novel approach, different from what the existing literature describes, I'm certainly not able to argue with him.
- mmooss 1 month ago
  
  > A proof that Terence Tao and his colleagues have never heard of?
  Tao et al. didn't know of the literature proof that started this subthread.
  
  9 replies →
heliumtera 1 month ago

Does it matter if it copied or not? How the hell would one even define if it is a copy or original at this point?
At this point the only conclusion here is: The original proof was on the training set. The author and Terence did not care enough to find the publication by erdos himself