← Back to context

Comment by demirbey05

5 days ago

Google also joined IMO, and got gold prize.

https://x.com/natolambert/status/1946569475396120653

OAI announced early, probably we will hear announcement from Google soon.

Google’s AlphaProof, which got a silver last year, has been using a neural symbolic approach. This gold from OpenAI was pure LLM. We’ll have to see what Google announces, but the LLM approach is interesting because it will likely generalize to all kinds of reasoning problems, not just mathematical proofs.

  • OpenAI’s systems haven’t been pure language models since the o models though, right? Their RL approach may very well still generalize, but it’s not just a big pre-trained model that is one-shotting these problems.

    The key difference is that they claim to have not used any verifiers.

    • What do you mean by “pure language model”? The reasoning step is still just the LLM spitting out tokens and this was confirmed by Deepseek replicating the o models. There’s not also a proof verifier or something similar running alongside it according to the openai researchers.

      If you mean pure as in there’s not additional training beyond the pretraining, I don’t think any model has been pure since gpt-3.5.

      1 reply →

  • > it will likely generalize to all kinds of reasoning problems, not just mathematical proofs

    Big if true. Setting up an RL loop for training on math problems seems significantly easier than many other reasoning domains. Much easier to verify correctness of a proof than to verify correctness (what would this even mean?) for a short story.

  • I’m much more excited about the formalized approach, as LLM’s are susceptible to making things up. With formalization, we can be mathematically certain that a proof is correct. This could plausibly lead to machines surpassing humans in all areas of math. With a “pure English” approach, you still need a human to verify correctness.

  • Neither Gemini or OpenAI have open models. We don’t know for sure what’s happening underneath.

Given the Noam Brown comment ("It was a surprise even to many researchers at OpenAI") it seems extra surprising if multiple labs achieved this result at once.

There's a comment on this twitter thread saying the Google model was using Lean, while IIUC the OpenAI one was pure LLM reasoning (no tools). Anyone have any corroboration?

In a sense it's kinda irrelevant, I care much more about the concrete things AI can achieve, than the how. But at the same time it's very informative to see the limits of specific techniques expand.