Comment by energy123
6 days ago
That would require AIME 2024 going above 100%.
There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.
Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.
If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.
There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:
"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task
ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"
- https://x.com/arcprize/status/1932535378080395332
I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.
Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.
"most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.
10 replies →
arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"
7 replies →