Comment by sigmar

11 days ago

Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.

  • I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.

    edit: they just removed the reference to "3.1" from the pdf

    • I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.

      1 reply →

  • That's odd considering 3.0 is still labeled a "preview" release.

    • I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.

Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway.

> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.