Comment by scrollop
5 hours ago
Used an AI to populate some of 5.1 thinking's results.
Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 | GPT-5.1 Thinking
---------------------------|--------------|----------------|-------------------|---------|------------------
Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%
ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%
GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%
AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%
MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%
MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%
ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%
CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A
OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A
Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A
LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A
Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A
SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A
t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A
Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A
FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A
SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A
MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A
Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A
MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A
Argh it doesn't come out write in HN
Used an AI to populate some of 5.1 thinking's results.
Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes
Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%
ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning
GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)
AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly
MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus
MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)
ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%
CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A
This is provably false. All it takes is a simple Google search and looking at the ARC AGI 2 leaderboard: https://arcprize.org/leaderboard
The 17.6% is for 5.1 Thinking High.
What? The 4.5 and 5.1 columns aren't thinking in Google's report?
That's a scandal, IMO.
Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.
It that true?
> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.
https://storage.googleapis.com/deepmind-media/gemini/gemini_...
Every single time