Comment by dcre
2 days ago
The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.
- https://arcprize.org/leaderboard
- https://aider.chat/docs/leaderboards/
- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
Surely the IMO problems weren't "within the bounds" of Gemini's training data.
The Gemini IMO result used a specifically fine tuned model for math.
Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.
>The Gemini IMO result used a specifically fine tuned model for math.
This is false.
https://x.com/YiTayML/status/1947350087941951596
This is false even for the OpenAI model
https://x.com/polynoamial/status/1946478250974200272
"Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."
Every human taking that exam has fine tuned for math, specifically on IMO problems.