← Back to context

Comment by cornel_io

6 days ago

I'm all for benchmarks that push the field forward, but ARC problems seem to be difficult for reasons having less to do with intelligence and more about having a text system that works reliably with rasterized pixel data presented line by line. Most people would score 0 on it if they were shown the data the way an LLM sees it, these problems only seem easy to us because there are visualizers slapped on top.

What is a visualisation?

Our rod and cone cells could just as well be wired up in any other configuration you care to imagine. And yet, an organisation or mapping that preserves spatial relationships has been strongly preferred over billions of years of evolution, allowing us most easily to make sense of the world. Put another way, spatial feature detectors have emerged as an incredible versatile substrate for ‘live-action’ generation of world models.

What do we do when we visualise, then? We take abstract relationships (in data, in a conceptual framework, whatever) and map them in a structure-preserving way to an embodiment (ink on paper, pixels on screen) that can wind its way through our perceptual machinery that evolved to detect spatial relationships. That is, we leverage our highly developed capability for pattern matching in the visual domain to detect patterns that are not necessarily visual at all, but which nevertheless have some inherent structure that is readily revealed that way.

What does any of this entail for machine intelligence?

On the one hand, if a problem has an inherent spatial logic to it, then it ought to have good learning gradients in the direction of a spatial organisation of the raw input. So, if specifically training for such a problem, the serialisation probably doesn’t much matter.

On the other hand: expecting a language model to generalise to inherently spatial reasoning? I’m totally with you. Why should we expect good performance?

No clue how the unification might be achieved, but I’d wager that language + action-prediction models will be far more capable than models grounded in language alone. After all, what does ‘cat’ mean to a language model that’s never seen one pounce and purr and so on? (Pictures don’t really count.)