Comment by sigmar

11 days ago

Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

12 comments

sigmar

gs17 11 days ago

Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.

sigmar 11 days ago
I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.
edit: they just removed the reference to "3.1" from the pdf
- josalhor 11 days ago
  
  I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.
  
  1 reply →
staticman2 11 days ago
That's odd considering 3.0 is still labeled a "preview" release.
- ainch 11 days ago
  
  I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.
WarmWash 11 days ago
The rumor was that 3.1 was today's drop
- losvedir 11 days ago
  
  Where are these rumors floating around?
  
  1 reply →

thadk 11 days ago

Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway.

riku_iki 11 days ago

> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.