Comment by EvgeniyZh

6 months ago

> an unsubstantiated claim about widespread misconduct.

I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you.

> Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them.

The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of.

> This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities.

How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers.

> someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field

My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs.

[1] https://aclanthology.org/2024.naacl-long.482/ [2] https://arxiv.org/abs/2412.15194

> "I can't prove it, but I heard it from multiple people in the industry"

The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope.

> "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement."

This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

> "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself"

This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

> "My credentials are in my profile, not that I think they should matter."

The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen.

This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context.

  • > Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training")

    Validation is not training, period. I'll ask again: what is the possible goal of accessing the evaluation set if you don't plan to use it for anything except the final evaluation, which is what the test set is used for? Either they just asked for access without any intent to use the provided data in any way except for final evaluation, which can be done without access, or they did somehow utilize the provided data, whether by training on it (which they verbally promised not to), using it as a validation set, using it to create a similar training set, or something else.

    > This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

    OpenAI is not doing science; they are doing business.

    > This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

    The metrics matter to people, but this doesn't mean people can meaningfully predict the model's performance using them. If I were trying to describe each of your arguments as some demagogue technique (you're going to call it ad hominem or something, probably), then I'd say it's a false dichotomy: it can, in fact, be impossible to use metrics to predict performance precisely enough and for people to care about metrics simultaneously.

    > The attempted simultaneous appeal to and dismissal of credentials

    I'm not appealing to credentials. Based on what I wrote, you made a wrong guess about my credentials, and I pointed out your mistake.

    > at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

    Your position, on the other hand, rests on the assumption that corporations behave ethically and with integrity beyond what is required by the law (and, specifically, their contracts with other entities).

    • > Validation is not training, period.

      Sure, but what we care about isn't the semantics of the words, its the effects of what they're doing. Iterated validation plus humans doing hyperparameter tuning will go a long way towards making a model fit the data, even if you never technically run backprop with the validation set as input.

      > OpenAI is not doing science; they are doing business.

      Are you implying these are orthogonal? OpenAI is a business centered on an ML research lab, which does research, and which people in the research community have generally come to respect.

      > at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

      No, it doesn't. What OP is doing is critiquing OpenAI for their misbehavior. This is one of the few levers we (who do not have ownership or a seat on their board) have to actually influence their future decisionmaking -- well-reasoned critiques can convince people here (including some people who decide whether their company uses ChatGPT vs. Gemini vs. Claude vs. ...) that ChatGPT is not as good as benchmarks might claim, which in effect makes it more expensive for OpenAI to condone this kind of misbehavior going forward.

      The argument that "no companies are moral, so critiquing them is pointless" is just an indirect way of running cover for those same immoral companies.