Comment by EvgeniyZh

1 year ago

Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

As for "knowing it's bad", most people won't be able to tell a model scoring 25% and 10% apart. People who are using these models to solve math problems are tiny share of users and even tinier share of revenues. What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

9 comments

EvgeniyZh

refulgentis 1 year ago

> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.

> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.

This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.

> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:

They closed the big round weeks before.

The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

BeefWellington 1 year ago
> The "everybody does it" argument is a classic rationalization that doesn't actually justify anything.
I'd argue here the more relevant point is "these specific people have been shown to have done it before."
> The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.
I think what you're missing is the observation that so very little of that is actually applied in this case. "AI" here is not being treated as an actual science would be. The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.
- refulgentis 1 year ago
  
  > I'd argue here the more relevant point is "these specific people have been shown to have done it before."
  This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").
  > "AI" here is not being treated as an actual science would be.
  There's some truth here but also some sleight of hand. Yes, AI development often moves outside traditional academic channels. But, you imply this automatically means less rigor, which doesn't follow. Many industry labs have internal review processes, replication requirements, and validation procedures that can be as or more stringent than academic peer review. The fact that something isn't in Nature doesn't automatically make it less rigorous.
  > The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.
  This combines three questionable implications:
  - That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)
  - That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)
  - That volume ("pumped out") correlates with quality
  You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.
  This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:
  - Their own employees.
  - Other researchers who will try to replicate results.
  - Partners integrating their technology.
  - Investors doing technical due diligence.
  - Regulators scrutinizing their claims.
  The idea that they would casually risk all that just to bump up one benchmark number (but not too much! just from 10% to 35%) doesn't align with the actual incentive structure these organizations face.
  Both the original comment and this fall into the same trap - mistaking cynicism for sophistication while actually displaying a somewhat superficial understanding of how modern AI research and development actually operates.
  
  2 replies →
EvgeniyZh 1 year ago
> an unsubstantiated claim about widespread misconduct.
I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you.
> Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them.
The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of.
> This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities.
How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers.
> someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field
My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs.
[1] https://aclanthology.org/2024.naacl-long.482/ [2] https://arxiv.org/abs/2412.15194
- refulgentis 1 year ago
  
  > "I can't prove it, but I heard it from multiple people in the industry"
  The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope.
  > "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement."
  This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.
  > "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself"
  This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).
  > "My credentials are in my profile, not that I think they should matter."
  The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.
  When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen.
  This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context.
  
  2 replies →