Comment by do_not_redeem

5 days ago

A third party tried this experiment with publicly available models. OpenAI did half as well as Gemini, and none of the models even got bronze.

https://matharena.ai/imo/

10 comments

do_not_redeem

jsnell 5 days ago

I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.

It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.

jononor 4 days ago

If OpenAI would publish the models before the competition, then one could verify that they were not tinkered with. Assuming that there exists a way for them to prove that a model is the same, at least. Since the weights are not open, the most basic approach is void.
do_not_redeem 5 days ago

Fair enough, edited.
gettingoverit 4 days ago
Implying results are fraudulent is completely fair when it is a fraud.
The previous time they had claims about solving all of the math right there and right then, they were caught owning the company that makes that independent test, and could neither admit nor deny training on closed test set.
- tedsanders 4 days ago
  
  Just to quickly clarify:
  - OpenAI doesn't own Epoch AI (though they did commission Epoch to make the eval)
  - OpenAI denied training on the test set (and further denied training on FrontierMath-derived data, training on data targeting FrontierMath specifically, or using the eval to pick a model checkpoint; in fact, they only downloaded the FrontierMath data after their o3 training set was frozen and they didn't look at o3's FrontierMath results until after the final o3 model was already selected. primary source: https://x.com/__nmca__/status/1882563755806281986)
  You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hints of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.
  Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.
  
  4 replies →

CamperBob2 4 days ago

They didn't try o3-pro, which (while slow) is far in front of the competition right now.