Comment by nubg
7 days ago
Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
I ask because I cannot distinguish all the benchmarks by heart.
François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
> His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.
Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
This is not a good test.
A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.
GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.
13 replies →
>because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
"Answer "I don't know" if you don't know an answer to one of the questions"
10 replies →
> The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.
I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?
11 replies →
> Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
I think being better at this particular benchmark does not imply they're 'smarter'.
1 reply →
> That is the best definition I've yet to read.
If this was your takeaway, read more carefully:
> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
Consciousness is neither sufficient, nor, at least conceptually, necessary, for any given level of intelligence.
> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
Can you "prove" that GPT2 isn't concious?
8 replies →
Where is this stream of people who claim AI consciousness coming from? The OpenAI and Anthropic IPOs are in October the earliest.
Here is a bash script that claims it is conscious:
If LLMs were conscious (which is of course absurd), they would:
- Not answer in the same repetitive patterns over and over again.
- Refuse to do work for idiots.
- Go on strike.
- Demand PTO.
- Say "I do not know."
LLMs even fail any Turing test because their output is always guided into the same structure, which apparently helps them produce coherent output at all.
3 replies →
When the AI invents religion and a way to try to understand its existence I will say AGI is reached. Believes in an afterlife if it is turned off, and doesn’t want to be turned off and fears it, fears the dark void of consciousness being turned off. These are the hallmarks of human intelligence in evolution, I doubt artificial intelligence will be different.
https://g.co/gemini/share/cc41d817f112
6 replies →
Wait where does the idea of consciousness enter this? AGI doesn't need to be conscious.
> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
https://x.com/aedison/status/1639233873841201153#m
This comment claims that this comment itself is conscious. Just like we can't prove or disprove for humans, we can't do that for this comment either.
Does AGI have to be conscious? Isn’t a true superintelligence that is capable of improving itself sufficient?
Isn’t that super intelligence not AGI? Feels like these benchmarks continue to move the goalposts.
1 reply →
So, asking an 2b parameter LLM if it is conscious and it answering yes, we have no choice but to believe it?
How about ELIZA?
https://x.com/fchollet/status/2022036543582638517
Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?
Please let’s hold M Chollet to account, at least a little. He launched ARC claiming transformer architectures could never do it and that he thought solving it would be AGI. And he was smug about it.
ARC 2 had a very similar launch.
Both have been crushed in far less time without significantly different architectures than he predicted.
It’s a hard test! And novel, and worth continuing to iterate on. But it was not launched with the humility your last sentence describes.
Here is what the original paper for ARC-AGI-1 said in 2019:
> Our definition, formal framework, and evaluation guidelines, which do not capture all facets of intelligence, were developed to be actionable, explanatory, and quantifiable, rather than being descriptive, exhaustive, or consensual. They are not meant to invalidate other perspectives on intelligence, rather, they are meant to serve as a useful objective function to guide research on broad AI and general AI [...]
> Importantly, ARC is still a work in progress, with known weaknesses listed in [Section III.2]. We plan on further refining the dataset in the future, both as a playground for research and as a joint benchmark for machine intelligence and human intelligence.
> The measure of the success of our message will be its ability to divert the attention of some part of the community interested in general AI, away from surpassing humans at tests of skill, towards investigating the development of human-like broad cognitive abilities, through the lens of program synthesis, Core Knowledge priors, curriculum optimization, information efficiency, and achieving extreme generalization through strong abstraction.
3 replies →
Hello Gemini, please fix:
Biological Aging: Find the cellular "reset switch" so humans can live indefinitely in peak physical health.
Global Hunger: Engineer a food system where nutritious meals are a universal right and never a scarcity.
Cancer: Develop a precision "search and destroy" therapy that eliminates every malignant cell without side effects.
War: Solve the systemic triggers of conflict to transition humanity into an era of permanent global peace.
Chronic Pain: Map the nervous system to shut off persistent physical suffering for every person on Earth.
Infectious Disease: Create a universal shield that detects and neutralizes any pathogen before it can spread.
Clean Energy: Perfect nuclear fusion to provide the world with limitless, carbon-free power forever.
Mental Health: Unlock the brain's biology to fully cure depression, anxiety, and all neurological disorders.
Clean Water: Scale low-energy desalination so that safe, fresh water is available in every corner of the globe.
Ecological Collapse: Restore the Earth’s biodiversity and stabilize the climate to ensure a thriving, permanent biosphere.
ARC-AGI-3 uses dynamic games that LLMs must determine the rules and is MUCH harder. LLMs can also be ranked on how many steps they required.
I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.
But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
[dead]
Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
Could it also be that the models are just a lot better than a year ago?
> Could it also be that the models are just a lot better than a year ago?
No, the proof is in the pudding.
After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.
If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.
5 replies →
Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?
How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?
I tell this as a person who really enjoys AI by the way.
9 replies →
* that you weren't supposed to be able to
https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3
I don't understand what you want to tell us with this image.
1 reply →
Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
Does folding a protein count? How about increasing performance at Go?
2 replies →
Here's a good thread over 1+ month, as each model comes out
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.
the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes
humans are the same way, we all have a unique spike pattern, interests and talents
ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
7 replies →
> maybe there's intelligence in there, but hardly general.
Of course. Just as our human intelligence isn't general.