Comment by solarwindy
6 days ago
What is a visualisation?
Our rod and cone cells could just as well be wired up in any other configuration you care to imagine. And yet, an organisation or mapping that preserves spatial relationships has been strongly preferred over billions of years of evolution, allowing us most easily to make sense of the world. Put another way, spatial feature detectors have emerged as an incredible versatile substrate for ‘live-action’ generation of world models.
What do we do when we visualise, then? We take abstract relationships (in data, in a conceptual framework, whatever) and map them in a structure-preserving way to an embodiment (ink on paper, pixels on screen) that can wind its way through our perceptual machinery that evolved to detect spatial relationships. That is, we leverage our highly developed capability for pattern matching in the visual domain to detect patterns that are not necessarily visual at all, but which nevertheless have some inherent structure that is readily revealed that way.
What does any of this entail for machine intelligence?
On the one hand, if a problem has an inherent spatial logic to it, then it ought to have good learning gradients in the direction of a spatial organisation of the raw input. So, if specifically training for such a problem, the serialisation probably doesn’t much matter.
On the other hand: expecting a language model to generalise to inherently spatial reasoning? I’m totally with you. Why should we expect good performance?
No clue how the unification might be achieved, but I’d wager that language + action-prediction models will be far more capable than models grounded in language alone. After all, what does ‘cat’ mean to a language model that’s never seen one pounce and purr and so on? (Pictures don’t really count.)
No comments yet
Contribute on Hacker News ↗