← Back to context

Comment by fishpham

7 days ago

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

Could it also be that the models are just a lot better than a year ago?

  • > Could it also be that the models are just a lot better than a year ago?

    No, the proof is in the pudding.

    After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

    If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

  • How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

    I tell this as a person who really enjoys AI by the way.

    • > does leak per definition.

      As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.

      The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

      IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.

      5 replies →

    • Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

      The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.

      1 reply →

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

  • Does folding a protein count? How about increasing performance at Go?

    • "Optimize this extremely nontrivial algorithm" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence.