Comment by energy123

9 months ago

That would require AIME 2024 going above 100%.

There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.

Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.

If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.

21 comments

energy123

croddin 9 months ago

There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:

"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task

ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task

Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"

- https://x.com/arcprize/status/1932535378080395332

saberience 9 months ago
I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.
Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.
- nipah 9 months ago
  
  "most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.
  
  10 replies →
- HDThoreaun 9 months ago
  
  arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"
  
  7 replies →