Comment by causal
1 day ago
Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.
I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.
I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)
(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)
This counterpoint doesn't address the issue, and I would argue that it is partially bad faith.
Yes, making it to the test center is significantly harder, but in fact the humans could have solved it from their home PC instead, and performed the exact same. However, if they were given the same test as the LLMs, forbidden from input beyond JSON, they would have failed. And although buying robots to do the test is unfeasible, giving LLMs a screenshot is easy.
Without visual input for LLMs in a benchmark that humans are asked to solve visually, you are not comparing apples to apples. In fact, LLMs are given a different and significantly harder task, and in a benchmark that is so heavily weighted against the top human baseline, the benchmark starts to mean something extremely different. Essentially, if LLMs eventually match human performance on this benchmark, this will mean that they in fact exceed human performance by some unknown factor, seeing as human JSON performance is not measured.
Personally, this hugely decreased my enthusiasm for the benchmark. If your benchmark is to be a North star to AGI, labs should not be steered towards optimizing superhuman JSON parsing skills. It is much more interesting to steer them towards visual understanding, which is what will actually lead the models out into the world.
I just realized that this also means that the benchmark is in practice unverified by third parties, as all tasks are not verified to be solvable through the JSON interface. Essentially there is no guarantee that it is even possible to understand how to complete every task optimally through the JSON interface alone.
I assume you did not develop the puzzles by visualizing JSON yourselves, and so there might be non obvious information that is lost in translation to JSON. Until humans optimally solve all the puzzles without ever having seen the visual version, there is no guarantee that this is even possible to do.
I think the only viable solution here is to release a version of the benchmark with a vision only harness. Otherwise it is impossible to interpret what LLM progress on this benchmark actually means.
1 reply →
Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.
The whole point of AGI is "general" intelligence, and for that intelligence to be broadly useful it needs to exist within the context of a human centric world
6 replies →
The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.
I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.
The issue is that ARC AGI 3 specifically forbids harnesses that humans get to use.
So what? Are you suggesting that an agent exhibiting genuine AGI will be tripped up by having to ingest json rather than rgb pixels? LLMs are largely trained on textual data so json is going to be much closer to whatever native is for them.
But by all means, give the agents access to an API that returns pixel data. However I fully expect that would reduce performance rather than increase it.
10 replies →
My sense is that a powerful enough AI would have the sense to think something like "ah, this sounds like a video game! Let me code up an interactive GUI, test it for myself, then use it to solve these puzzles..." and essentially self-harness (the way you would if you were reading a geometry problem, by drawing it out on paper).
Yeah but thats literally above ASI, let alone AGI. Average human scores <1% on this bench, opus scores 97.1% when given an actual vision access, which means agi was long ago achieved
> opus scores 97.1% when given an actual vision access
Do you have a source for this? I would be very curious to see how top models do with vision.
2 replies →