Comment by rafael859
5 days ago
Interesting that the proofs seem to use a limited vocabulary: https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...
Why waste time say lot word when few word do trick :)
Also worth pointing out that Alex Wei is himself a gold medalist at IOI.
Interesting observation. One one hand, these resemble more the notes that an actual participant would write while solving the problem. Also, less words = less noise, more focus. But also, specifically for LLMs that output one token at a time and have a limited token context, I wonder if limiting itself to semantically meaningful tokens can be create longer stretches of semantically coherent thought?
The original thread mentions “test-time compute scaling” so they had some architecture generating a lot of candidate ideas to evaluate. Minimizing tokens can be very meaningful from a scalability perspective alone!
This is just speculation but I wouldn't be surprised if there were some symbolic AI 'tricks'/tools (and/or modern AI trained to imitiate symbolic AI) under the hood.
He is talking about IMO (math olympiad) while he got gold at IOI (informatics olympiad) :)
> Also worth pointing out that Alex Wei is himself a gold medalist at IOI.
Terence Tao also called it, that the top LLMs would get gold this year in a recent podcast.
In transformers generating each token takes the same amount of time, regardless of how much meaning it carries. By cutting out the filler from the text, you get a huge speedup.
Except generating more tokens also effectively extends the computational power beyond the depth of the circuit, which is why chain of thought works in the first place. Even sampling only dummy tokens that don't convey anything still provides more computational power.
I mean, generating more tokens means you use more computing power, and there's som e evidence that not all of these filler words go to waste (esp since they are not really words, but vectors that can carry latent meaning), as models tend to become smarter when allowed to generate a lot of heeming and hawing.
It's been proven that this accidental computation is actually helping CoT models, but they're not supposed to work like that - they're supposed to generate logical observations and use said observations to work further towards the goal (and they primarily do do that).
Considering filler tokens occupy context space and are less useful than meaningful tokens, a model that tries to maximize useful results per amount of compute, you'd want a terse context window without any fluff.
Dummy tokens work for humans too. “Shhh I need to think!”
1 reply →
Are you saying "see the world?" or "seaworld"?
whoah, very very interesting / telling.