← Back to context

Comment by charlieyu1

4 days ago

I tried P1 on chatgpt-o4-high, it tells me the solution is k=0 or 1. It doesn’t even know that k=3 is a solution for n=3. Such a solution would get 0/7 in the actual IMO.

What about o3-pro? Remember that their model names make no sense.

Edit due to rate-limiting:

o3-pro returned an answer after 24 minutes: https://chatgpt.com/share/687bf8bf-c1b0-800b-b316-ca7dd9b009... Whether the CoT amounts to valid mathematical reasoning, I couldn't say, especially because OpenAI models tend to be very cagey with their CoT.

Gemini 2.5 Pro seems to have used more sophisticated reasoning ( https://g.co/gemini/share/c325915b5583 ) but it got a slightly different answer. Its chain of thought was unimpressive to say the least, so I'm not sure how it got its act together for the final explanation.

Claude Opus 4 appears to have missed the main solution the others found: https://claude.ai/share/3ba55811-8347-4637-a5f0-fd8790aa820b

Be interesting if someone could try Grok 4.

  • Only have basic o3 to try. Spent like 10 minutes but did not return any response due to a network error. Checking the thoughts, the model was doing a lot of brute forcing up to n=8, and found k=0,1,3, but no mathematical reasoning was seen.

    • See how this compares to what you got from o3: https://chatgpt.com/share/687bf8bf-c1b0-800b-b316-ca7dd9b009...

      It convincingly argues that Gemini's answer was wrong, and Gemini agrees ( https://g.co/gemini/share/aa26fb1a4344 ).

      So that's pretty cool, IMO. Pitting these two models against each other in a cage match is an underused hack in my experience.

      Another observation worth making is that (looking at the Github link) OpenAI didn't just paste an image of the question into the prompt, hit the button and walk away, like I did. They rewrote the prompts carefully to get the best results, and I'm a little surprised people aren't crying foul about that. So I'm pretty impressed with o3-pro's unassisted performance.

      1 reply →