Comment by jwpapi
1 day ago
This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.
I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.
I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.
General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.
This so far has been the best test.
And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.
AGI’s 'general' is the wrong word, I thinkg. Humans aren’t general, we’re jagged. Strong in some areas, weak in others, and already surpassed in many domains.
LLM are way past us at languages for instance. Calculators passed us at calculating, etc.
We don't call a calculator intelligent.
A calculator is extremely useful, but it is not intelligent.
A computer is extremely useful, but it is not intelligent.
Airplanes don't have wings, but they're damn sure useful, and also not intelligent.
If LLMs cannot learn to beat not-that-difficult of games better than young teens, they are not intelligent.
They are extremely useful. But they are not AGI.
Words matter.
> If LLMs cannot learn to beat not-that-difficult of games better than young teens, they are not intelligent.
I agree, with unresolved questions. Does it count if the LLM writes code which trains a neural network to play the game, and that neural network plays the game better than people do? Does that only count if the LLM tries that solution without a human prompting it to do so?
2 replies →
So your definition of intelligence would be exactly equal to a human or some subset of them you choose? Could a dog solve ARC-AGI? Probably not. I would not say they lack intelligence. Same with a fruit fly. What if the calculator is powered by actual living neurons? I think you need to know where you actually think the difference between organic machine and intelligence is before making blanket statements.
A modern LLM in a loop with a harness for memory and behavior modification in a body would probably fool me.
1 reply →
> Airplanes don't have wings
???
Interesting take.
Just to drive that thought further.
What are you suggesting, should we rename it. To me the fundamental question is this.
Do we still have tasks that humans can do better than AIs?.
I like the question. I think another good test is "make money". There are humans that can generate money from their laptop. I don’t think AI will be net positive.
I’ve tried to create a Polymarket trading bot with Opus 4.6. The ideas were full of logical fallacies and many many mistakes.
But also I’m not sure how they would compare against an average human with no statistics background..
I think it’s really to establish if we by AGI mean better than average human or better than best human..
I don't have a good alternative sadly. Human Equivalent Intelligence? ChatGPT suggests "Systems that increasingly Pareto-dominate human intelligence across domains". Not so catchy.
The "things that currently make money" definition is interesting. Bc they are the things that automation can't currently do, because could be automated, then price would tend to 0 and and couldn't make money at it.
We are jagged, but we can smooth that jaggedness if we choose to do so. LLMs stay jagged.
There's no objective measure of intelligence comparisons, we only say llm is jagged compared to humans.
I’d actually focus on something else entirely here.
Let's be honest: we are giving LLMs and humans the exact same tasks, but are we putting them on an equal playing field? Specifically, do they have access to the same resources and behavioral strategies?
- LLMs don't have spatial reasoning.
- LLMs don't have a lifetime of video game experience starting from childhood.
- LLMs don't have working memory or the ability to actually "memorize" key parameters on the fly.
- LLMs don't have an internal "world model" (one that actively adapts to real-world context and the actual process of playing a game).
... I could go on, but I've outlined the core requirements for beating these tests above.
So, are we putting LLMs and humans in the same position? My answer is "no." We give them the same tasks, but their approach to solving them—let alone their available resources—is fundamentally different. Even Einstein wouldn't necessarily pass these tests on the first try. He’d first have to figure out how to use a keyboard, and then frantically start "building up new experience."
P.S. To quickly address the idea that LLMs and calculators are just "useful tools" that will never become AGI—I have some bad news there too. We differ from calculators architecturally; we run on entirely different "processors." But with LLMs, we are architecturally built the same way: it is a Neural Network that processes and makes decisions. This means our only real advantage over them is our baseline configuration and the list of "tools" connected to our neural network (senses, motor functions, etc.). To me, this means LLMs don't have any fundamental "architectural" roadblocks. We just have a head start, but their speed of evolution is significantly faster.
>But with LLMs, we are architecturally built the same way: it is a Neural Network that processes and makes decisions.
There are high-level similarities between ANNs and the human brain but they are very, very, very different in a ton of ways.
1 reply →
LLMs haven't passed us in language, a child can learn language with so so much less data than an LLM can
isn't that more like rate of learning? Agreed LLM consume a lot of data.
But your average LLM understands more languages then anyone alive. So super human understanding of various text based languages.
1 reply →
The thing is.. this is more akin to testing a blind person's performance on a driving test than testing his intelligence.
I would imagine if you simply encoded the game in textual format and asked an LLM to come up with a series of moves, it would beat humans.
The problem here is more around perception than anything.
I had the same theory back when ARC-AGI-2 came out, and surprisingly encoding it into text didn't help much - LLMs just have a huge blind spot around spatial reasoning, in addition to being bad at vision. The sorts of logic and transformations involved in this just don't show up much in the training data (yet)
I still agree that this is like declaring blind people lack human intelligence, of course.
It only tests puzzle solving, intelligence is cost compression that powers itself.
Previous iterations of ARC-AGI were reminiscent of IQ tests. This one is just too easy and the fact that models do terribly bad on it probably means that there is input mode mismatch or operation mode mismatch.
If model creators are willing to teach their llms to play computer games through text it's gonna be solved in one minor bump of the model version. But honestly, I don't think they are gonna bother because it's just too stilly and they won't expect their models are going to learn anything useful from that.
Especially since there are already models that can learn how to play 8-bit games.
It feels like ARC-AGI jumped the shark. But who knows, maybe people who train models for robots are going to take it in stride.