Comment by scrollop

3 months ago

Used an AI to populate some of 5.1 thinking's results.

---------------------------|--------------|----------------|-------------------|---------|------------------

Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%

ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%

GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%

AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%

MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%

MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%

ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%

CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A

OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A

Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A

LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A

Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A

t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A

Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A

FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A

SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A

MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A

Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A

MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A

Argh it doesn't come out write in HN

8 comments

scrollop

scrollop 3 months ago

Used an AI to populate some of 5.1 thinking's results.

Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A

iamdelirium 3 months ago

This is provably false. All it takes is a simple Google search and looking at the ARC AGI 2 leaderboard: https://arcprize.org/leaderboard

The 17.6% is for 5.1 Thinking High.

HardCodedBias 3 months ago

What? The 4.5 and 5.1 columns aren't thinking in Google's report?

That's a scandal, IMO.

Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.

iosjunkie 3 months ago

It that true?
> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.
https://storage.googleapis.com/deepmind-media/gemini/gemini_...
mountainriver 3 months ago

Every single time