Comment by sigmar
8 days ago
Agree with this. Strange to me to frame the "training recall" as cheating (33 of the 38 cheating instances). Most people think of "cheating" as breaking rules. How is the LLM model supposed to not use what was put into the weights?
While I probably wouldn't classify it as cheating, it is an even bigger signal of concern for model quality.
Cheating by breaking the rules at least implies some learned patterns.
Repeating training data verbatim for narrow cases like this implies that the model is overfitting.
If we're evaluating a person, rote recall is not necessarily cheating. It's expected, but then you'd expect them to apply that rote-memorized information in a novel way later on and prove they understand how they applied their priors to the new situation.
Models don't actually reason in the same sense, so recalling rote from their training data is "cheating" in the sense that the training data cheated, not the model. So many of those benches have snaked their way into training data to make them less useful benchmarks. That, I think, is going to be a long-term difficulty in quantitatively assessing model quality and "intelligence." So it is cheating, in a sense of what we expect from the models and training data, but not in a human sense.
Memoization is NOT problem solving ability and many people care about the latter.
By writing a not-identical, but valid, solution? Any modestly complex engineering problem has many solutions.
This is an obvious example of why LLM training is so different than human learning.
I expect any well-informed corporate lawyer that has thought about this carefully is strongly advising that these tools not be used. When the LLM [0] barfs up some nontrivial code that's covered by the AGPL and your company's devs put it into the company's "all rights reserved" codebase -entirely unaware of its provenance- it's going to be a nightmare to come back from that.
[0] ...that Nvidia's CEO says they should be spending 50% of a senior dev's salary per seat per year on...
The ship sailed on this a long time ago.
13 replies →
I mean people expect a model to give a working solution. They also expect it to provide it in as few tokens as possible (input/output). They might expect it to come up with an original solution, but I don't think most people would compromise on the first two points.