← Back to context

Comment by chairhairair

5 days ago

OpenAI simply can’t be trusted on any benchmarks: https://news.ycombinator.com/item?id=42761648

Remember that they've fired all whistleblowers that would admit to breaking the verbal agreement that they wouldn't train on the test data.

Somewhat related, but I’ve been feeling as of late what can best be described as “benchmark fatigue”.

The latest models can score something like 70% on SWE-bench verified and yet it’s difficult to say what tangible impact this has on actual software development. Likewise, they absolutely crush humans at sport programming but are unreliable software engineers on their own.

What does it really mean that an LLM got gold on this year’s IMO? What if it means pretty much nothing at all besides the simple fact that this LLM is very, very good at IMO style problems?

  • Far as i can tell here, the actual advancement is in the methodology used to create a model tuned for this problem domain, and how efficient that method is. Theoretically then, making it easier to build other problem-domain-specific models.

    That a highly tuned model designed to solve IMO problems can solve IMO problems is impressive, maybe, but yeah it doesn't really signal any specific utility otherwise.

I don't fault you for maintaining a healthy scepticism, but per the President of the IMO: "It is very exciting to see progress in the mathematical capabilities of AI models, but we would like to be clear that the IMO cannot validate the methods, including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced. What we can say is that correct mathematical proofs, whether produced by the brightest students or AI models, are valid." [1]

The proofs are correct, and it's very unlikely that IMO problems were leaked ahead of time. So the options for cheating in this circumstance are that a) IMO are colluding with a few researchers at OpenAI for some reason, or b) @alexwei_ solved the problems himself - both seem pretty unlikely to me.

[1] https://imo2025.au/wp-content/uploads/2025/07/IMO-2025_Closi...

  • Is b) really that unlikely?

    • Not really. This whole thing looks like a deliberately planned PR campaign, similar to the o3 demo. OpenAI has enough talented mathematicians. They had enough time to just solve the problems themselves. Alternatively, some participants leaking the questions for a reward isn't very unlikely either, and I definitely wouldn't put it past OpenAI to try something like that. Afterwards, they could secretly give hints or tool access to the model, or simply forge the answers, or keep rerunning the model until it gave out the correct answer. We know from FrontierMath and ARC-AGI that OpenAI can't be trusted when it comes to benchmarks.

On OpenAI's own released papers they show Anthropic's models performing better than their own. They tend to be pretty transparent and reliable in honesty in their benchmarks.

The thing is, only leading AI companies and big tech have the money to fund these big benchmarks and run inference on them. As long as the benchmarks are somewhat publicly available and vetted by reputable scientists/mathematicians it seems reasonable to believe they're trustworthy.

Not to beat a dead horse or get into a debate, but to hopefully clarify the record:

- OpenAI denied training on FrontierMath, FrontierMath-derived data, or data targeting FrontierMath specifically

- The training data for o3 was frozen before OpenAI even downloaded FrontierMath

- The final o3 model was selected before OpenAI looked at o3's FrontierMath results

Primary source: https://x.com/__nmca__/status/1882563755806281986

You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hint of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.

Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

The International Math Olympiad isn’t an AI benchmark.

It’s an annual human competition.