Comment by refulgentis
6 months ago
Its increasingly odd to see HN activity that assumes the premise: if the latest benchmark results involved a benchmark that can be shown to have any data that OpenAI could have accessed, then, the benchmark results were intentionally faked.
Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.
Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.
[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions
Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.
And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.
As for "knowing it's bad", most people won't be able to tell a model scoring 25% and 10% apart. People who are using these models to solve math problems are tiny share of users and even tinier share of revenues. What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.
> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.
This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.
> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.
This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.
> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.
This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.
> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.
This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:
They closed the big round weeks before.
The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.
> The "everybody does it" argument is a classic rationalization that doesn't actually justify anything.
I'd argue here the more relevant point is "these specific people have been shown to have done it before."
> The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.
I think what you're missing is the observation that so very little of that is actually applied in this case. "AI" here is not being treated as an actual science would be. The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.
3 replies →
> an unsubstantiated claim about widespread misconduct.
I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you.
> Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them.
The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of.
> This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities.
How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers.
> someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field
My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs.
[1] https://aclanthology.org/2024.naacl-long.482/ [2] https://arxiv.org/abs/2412.15194
3 replies →