← Back to context

Comment by modeless

9 hours ago

> AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions

So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.

I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.

Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305

Sounds like it did not:

> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit

  • I interpreted that bit as meaning they did not manually alter the problem statement before feeding it to the model - they gave it the exact problem text issued by IMO.

    It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.

    • As a side question, do you think using tools like Lean will become a staple of these "deep reasoning" LLM flavors?

      It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.

      1 reply →

  • Yes, that quote is contained in my comment. But I don't think it unambiguously excludes tool use in the internal chain of thought.

    I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.

    • End to end in natural language would imply no tool use, I'd imagine. Unless it called another tool which converted it but that would be a real stretch (smoke and mirrors).

We're told that formal verification tools like Lean are not used to solve the actual IMO problems, but are they used in training the model to solve the problems?

We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.

Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.

In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.

  • It seems very likely from the description in the link that formal verification tools for mathematical proofs were used in part of the RL training for this model. On the other hand, OpenAI claims "We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling." Which might suggest that they don't specifically use e.g. Lean in their training process. But it's not explicitly stated. All we can really do is speculate unless they publish more detail.

    • The OpenAI proofs are so brutally, inhumanly spartan that I can't imagine how the AI came up with them, except by RLVR against some crudely translated formal language.

I'd also be curious as to why not use Lean. Is it that Lean use at this point makes the problems too easy to brute force? Or is it that Lean at this point just gets in the way of things?

  • Lean is an interactive prover, not an automated prover. Last year a lot of human effort was required to formalise the problems in Lean before the machines could get to work. This year you get natural language input and output, and much faster.

    The advantage of Lean is that the system checks the solutions, so hallucination is impossible. Of course, one still relies on the problems and solutions being translated to natural language correctly.

    Some people prefer difficult to read formally checked solutions over informal but readable solutions. The two approaches are just solving different problems.

    But there is another important reason to want to do this reliably in natural language: you can't use Lean for other domains (with a few limited exceptions). They want to train their RL pipelines for general intelligence and make them reliable for long horizon problems. If a tool is needed as a crutch, then it more or less demonstrates that LLMs will not be enough in any domain, and we'll have to wait for traditional AI to catch up for every domain.

    • Oh, I didn't realize that last year the problem formalization was a human effort; I assumed the provers themselves took the problem and created the formalization. Is this step actually harder to automate than solving the problem once it's formalized?

      Anyway mainly I was curious whether using an interactive prover like Lean would have provided any advantage, or whether that is no longer really the case. My initial take would be that, yes, it should provide a huge advantage. Like in chess and go, it'd allow it to look algorithmically through a huge search space and check which approaches get it closer to resolving, where the AI is "only" responsible for determining what approaches to try.

      OTOH, maybe not. Maybe the search space is so big that trying to go through it linearly is a waste of CPU. In which case, plausibly the translation to Lean offers no benefit. And now that I think about it, I could imagine that. When doing problems like these, you kind of have to figure out the overall approach end to end first, fill in any gaps in your logic, and the formalization/writing step is kind of the last thing you do. So I could see where starting on formalization from the start could end up being the wrong approach for IMO-level problems. It'd just be nice to have that confirmed.

      The cool thing is that if true, it implies this is something completely different from the chess/go engines that rely on sheer computational power. Not so much of a "deep blue" moment, but more of an existential one.

I wonder if "not tool use, no internet access" means it can run without google inf, and offline. Meaning it could be deployed locally for people that need that.