← Back to context

Comment by jsnell

5 days ago

I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.

It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.

If OpenAI would publish the models before the competition, then one could verify that they were not tinkered with. Assuming that there exists a way for them to prove that a model is the same, at least. Since the weights are not open, the most basic approach is void.

Implying results are fraudulent is completely fair when it is a fraud.

The previous time they had claims about solving all of the math right there and right then, they were caught owning the company that makes that independent test, and could neither admit nor deny training on closed test set.

  • Just to quickly clarify:

    - OpenAI doesn't own Epoch AI (though they did commission Epoch to make the eval)

    - OpenAI denied training on the test set (and further denied training on FrontierMath-derived data, training on data targeting FrontierMath specifically, or using the eval to pick a model checkpoint; in fact, they only downloaded the FrontierMath data after their o3 training set was frozen and they didn't look at o3's FrontierMath results until after the final o3 model was already selected. primary source: https://x.com/__nmca__/status/1882563755806281986)

    You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hints of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.

    Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

    • That said, I missed the slight semantic difference between "being funded by" and "owning", even though I don't see how that would be different in practice.

      Regarding the second point, I don't see how "hav[ing] a verbal agreement that these materials will not be used in model training" would actually discourage someone from not doing it, because breaking that kind of verbal agreement wouldn't cause any harm.

      I have not been aware of those other claims on Twitter, but IMO they do not create sufficient basis for an investor fraud case either, because Twitter is not an official way of communcating to investors, which means they can claim whatever they want there. IANAL though.

      I'm really looking for FrontierMath-level problems to be solvable by OpenAI models, and being able to validate it myself, yet I don't have much hope it will happen during my lifetime.

    • > One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud.

      Everywhere I worked offered me a significant amount of money to sign a non-disparagement agreement after I left. I have never met someone who didn't willingly sign these agreements. The companies always make it clear if you refuse to sign they will give you a bad recommendation in the future.

    • One piece of evidence I have is that most powerful reasoning models can answer at most 10% of my questions on PhD-student level computer science, and are unable to produce correct implementations for basic algorithms, being provided direct references to their implementations and materials that describe them. Damn, o3 can't draw an SVG arrow. Recent advancement in counting "r" in "strawberry" is basically as far as it goes.

      I don't know what exactly is at play here, and how exactly OpenAI's models can produce those "exceptionally good" results in benchmarks and at the same time be utterly unable to do even a quarter of that in private evaluation of pretty much everyone I knew. I'd expect them to use some kind of RAG techniques that make the question "what was in the training set at model checkpoint" irrelevant.

      If you consider that several billion dollars of investment and national security are at stake, "weird conspiracy" becomes a regular Tuesday.

      Unfortunately I can't see beyond the first message of that primary source.

    • > if that's you feel there's probably not much I can say to change your mind

      you just brought several corp statements which are not grounded into any evidence, and could be not true, so you didn't say that much so far.

      > Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

      prize is XXB of investments and XXXB of valuation, so nothing weird in such conspiracy.