← Back to context

Comment by refulgentis

6 months ago

> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.

> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.

This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.

> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:

They closed the big round weeks before.

The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

> The "everybody does it" argument is a classic rationalization that doesn't actually justify anything.

I'd argue here the more relevant point is "these specific people have been shown to have done it before."

> The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

I think what you're missing is the observation that so very little of that is actually applied in this case. "AI" here is not being treated as an actual science would be. The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.

  • > I'd argue here the more relevant point is "these specific people have been shown to have done it before."

    This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").

    > "AI" here is not being treated as an actual science would be.

    There's some truth here but also some sleight of hand. Yes, AI development often moves outside traditional academic channels. But, you imply this automatically means less rigor, which doesn't follow. Many industry labs have internal review processes, replication requirements, and validation procedures that can be as or more stringent than academic peer review. The fact that something isn't in Nature doesn't automatically make it less rigorous.

    > The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.

    This combines three questionable implications:

    - That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)

    - That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)

    - That volume ("pumped out") correlates with quality

    You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.

    This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:

    - Their own employees.

    - Other researchers who will try to replicate results.

    - Partners integrating their technology.

    - Investors doing technical due diligence.

    - Regulators scrutinizing their claims.

    The idea that they would casually risk all that just to bump up one benchmark number (but not too much! just from 10% to 35%) doesn't align with the actual incentive structure these organizations face.

    Both the original comment and this fall into the same trap - mistaking cynicism for sophistication while actually displaying a somewhat superficial understanding of how modern AI research and development actually operates.

    • This reply reads as though it were AI generated.

      Let's bite though, and hope that unhelpful excessively long-winded replies are just your quirk.

      > This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").

      Ok, provide specifics yourself then. Someone replied and pointed out that they have every incentive to cheat, and your response was:

      > This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

      Respond to the content of the argument -- be specific. WHY is OpenAI not incentivized to cheat on this benchmark? Why is a once-nonprofit which turned from releasing open and transparent models to a closed model and begun raking in tens of billions of investor cash not incentivized to continue to make those investors happy? Be specific. Because there's a clear pattern of corporate behaviour at OpenAI and associated entities which suggests your take is not, in fact, the simpler viewpoint.

      > This combines three questionable implications: > - That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)

      Yes, arXiv will host lots of stuff that isn't real concrete research. They've hosted April Fool's jokes, for example.[1]

      > - That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)

      This is a poor/incorrect reading of the language. You have inferred meaning that does not exist. If citations are so important here, cite a few dozen that are peer reviewed out of the hundreds.

      > - That volume ("pumped out") correlates with quality

      Incorrect reading again. Volume here correlates with marketing and hype. It could have an effect on quality but that wasn't the purpose behind the language.

      > You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.

      Why is that unjustified? It's no different than any of the science background people who have fallen into flat earther beliefs. They may understand the methods but if they are not tested with rigor and have abandoned scientific principles they do not get to keep pretending it's as valid as actual science.

      > This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:

      FWIW, this regurgitated talking point is what makes me believe this is an LLM-generated reply. OpenAI is not a major research lab. They appear to essentially to be trading off the names of more respected institutions and mathematicians who came up with FrontierMath. The credibility damage here can be done by a single person sharing data with OpenAI, unbeknownst to individual participants.

      Separately, even under correct conditions it's not as if there are not all manner of problems in science in terms of ethical review. See for example, [2].

      [1] https://arxiv.org/abs/2003.13879 - FWIW, I'm not against scientists having fun, but it should be understood that arXiv is basically three steps above HN or reddit. [2] https://news.ycombinator.com/item?id=26887670

      1 reply →

> an unsubstantiated claim about widespread misconduct.

I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you.

> Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them.

The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of.

> This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities.

How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers.

> someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field

My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs.

[1] https://aclanthology.org/2024.naacl-long.482/ [2] https://arxiv.org/abs/2412.15194

  • > "I can't prove it, but I heard it from multiple people in the industry"

    The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope.

    > "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement."

    This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

    > "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself"

    This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

    > "My credentials are in my profile, not that I think they should matter."

    The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

    When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen.

    This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context.

    • > Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training")

      Validation is not training, period. I'll ask again: what is the possible goal of accessing the evaluation set if you don't plan to use it for anything except the final evaluation, which is what the test set is used for? Either they just asked for access without any intent to use the provided data in any way except for final evaluation, which can be done without access, or they did somehow utilize the provided data, whether by training on it (which they verbally promised not to), using it as a validation set, using it to create a similar training set, or something else.

      > This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

      OpenAI is not doing science; they are doing business.

      > This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

      The metrics matter to people, but this doesn't mean people can meaningfully predict the model's performance using them. If I were trying to describe each of your arguments as some demagogue technique (you're going to call it ad hominem or something, probably), then I'd say it's a false dichotomy: it can, in fact, be impossible to use metrics to predict performance precisely enough and for people to care about metrics simultaneously.

      > The attempted simultaneous appeal to and dismissal of credentials

      I'm not appealing to credentials. Based on what I wrote, you made a wrong guess about my credentials, and I pointed out your mistake.

      > at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

      Your position, on the other hand, rests on the assumption that corporations behave ethically and with integrity beyond what is required by the law (and, specifically, their contracts with other entities).

      1 reply →