Comment by saberience
7 days ago
The Arc prize/benchmark is a terrible judge of whether we got to AGI.
If we assume that humans have "general intelligence", we would assume all humans could ace Arc... but they can't. Try asking your average person, i.e. supermarket workers, gas station attendants etc to do the Arc puzzles, they will do poorly, especially on the newer ones, but AI has to do perfectly to prove they have general intelligence? (not trying to throw shade here but the reality is this test is more like an IQ test than an AGI test).
Arc is a great example of AI researchers moving the goal posts for what we consider intelligent.
Let's get real, Claude Opus is smarter than 99% of people right now, and I would trust its decision making over 99% of people I know in most situations, except perhaps emotion driven ones.
Arc agi benchmark is just a gimmick. Also, since it's a visual test and the current models are text based it's actually a rigged (against the AI models) test anyway, since their datasets were completely text based.
Basically, it's a test of some kind, but it doesn't mean quite as much as Chollet thinks it means.
He said in the video that they tested regular people (uber driver, etc.) on arc-agi2 and at least 2 people were able to solve each task (an average of 9-10 people saw each task). Also this quote from the paper: None of the self-reported demographic factors recorded for all participants—including occupation, industry, technical experience, programming proficiency, mathematical background, puzzle-solving aptitude, and var- ious other measured attributes—demonstrated clear, statistically significant relationships with performance outcomes. This finding suggests that ARC-AGI-2 tasks assess general problem-solving capabilities rather than domain-specific knowledge or specialized skills acquired through particular professional or educational experiences.
It is not a judge of whether we got to AGI. And literally no one except straw-manning critics are trying to claim it is. The point is, an AGI should easily be able to pass it. But it can obviously be passed without getting to AGI (as . It's a necessary but not sufficient criteria. If something can't pass a test as simple as AGI (which no AI currently can) then it's definitely not AGI. Anyone claiming AGI should be able to point their AI at the problem and have an 80+% solution rate. Current attempts on the second ARC are less than 10% with zero shot attempts even worse. Even the better performing LLMs on the first ARC couldn't do well without significant pre-training. In short, the G in AGI stands for general.
So do you agree that a human that CANNOT solve ARC doesn't have general intelligence?
If we think humans have "GI" then I think we have AIs right now with "GI" too. Just like humans do, AIs spike in various directions. They are amazing at some things and weak at visual/IQ test type problems like ARC.
It's a good question. But only complicated answers are possible. A puppy and crow and a raccoon all have intelligence but certainly can't all pass the ARC challenge.
I think the charitable interpretation is that, if intelligence is made up of many skills, and AIs are super human at some, like image recognition.
And that therefore, future efforts need to be on the areas where AIs are significantly less skilled. And also, since they are good at memorizing things, knowledge questions are the wrong direction and anything most humans could solve but that AIs can not, especially if as generic as pattern matching, should be an important target.
This is what is called "spikey" intelligence, where a model might be able to crack phd physics problems and solve byzantine pattern matching games at the 90th percentile, but also can't figure out how to look up a company and copy their address on the "customer" line of an invoice.
Maybe it is a cultural difference aspect, but I feel that "supermarket workers, gas station attendants" (in an Asian country) that I know of should be quite capable of most ARC tasks.
Out of 100 of evals, ARC is a very distinct and unique eval, most frontier models are also visual now, don't see the harm in having this instead of another text eval.