Comment by bogtog
7 days ago
I wonder how much slow progress on ARC can be explained by their visual properties making them easy for humans but hard for LLMs.
My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.
Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character
This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder
There are some major hints that this is indeed the case.
I've seen a simple ARC-AGI test that took the open set, and doubled every image in it. Every pixel became a 2x2 block of pixels.
If LLMs were bottlenecked solely by reasoning or logic capabilities, this wouldn't change their performance all that much, because the solution doesn't change all that much.
Instead, the performance dropped sharply - which hints that perception is the bottleneck.
I thought so too back when the test was first released, but now that we have multimodal models which can take images directly as input, shouldn't this point be moot?
I think the top performer afaik (ChatGPT o3) is still treating ARC as a series of characters. I imagine complex reasoning in multimodal processing wouldn't be nearly as advanced so treating it as characters is still better
interesting, I thought one of the whole points of o3 was mixed multimodal reasoning (e.g. everyone doing those geoguesser challenges). But maybe that's just a parlor trick and it's not actually implemented that way. I wonder when they're going to extend chain-of-thought to work with image tokens, seems like that'd help for solving spatial challenges like this.
1 reply →
Even the very best multimodal LLMs still suffer from a harsh perception bottleneck. They're impressive, but nowhere near as good as human visual cortex.