Comment by croddin

8 months ago

There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:

"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task

ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task

Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"

- https://x.com/arcprize/status/1932535378080395332

20 comments

croddin

saberience 8 months ago

I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.

Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.

nipah 8 months ago
"most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.
- viraptor 8 months ago
  
  > so extremely simple they have basically 100% human approval rates
  Are you thinking of a different set? Arc-agi-2 has average 60% success for a single person and questions require only 2 out of 9 correct answers to be accepted. https://docs.google.com/presentation/d/1hQrGh5YI6MK3PalQYSQs...
  > and even some other mammals to do.
  No, that's not the case.
  
  3 replies →
- saberience 8 months ago
  
  lol 100% approval rates? No they don’t.
  Also mammals? What mammals could even understand we were giving it a test?
  Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.
  This is a classic case of some phd ai guys making a benchmark and not really considering what average people are capable of.
  Look, these insanely capable ai systems can’t do these problems but the boys in the lab can do them, what a good benchmark.
  
  1 reply →
- clbrmbr 8 months ago
  
  You may be above average intelligence. Those challenges are like classic IQ tests and I bet have a significant distribution among humans.
  
  3 replies →
HDThoreaun 8 months ago
arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"
- saberience 8 months ago
  
  There are humans who cannot do arc agi though so how does an LLM not doing it mean that LLMs don’t have general intelligence?
  LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.
  But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?
  That must mean most humans on this planet aren’t generally intelligent too.
  
  6 replies →