Comment by toisanji
3 months ago
From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.
> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.
Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.
I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?
I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.
A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".
>There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?
https://www.youtube.com/watch?v=udPY5rQVoW0
This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.
That GTA demo isn't about control. The user, not the net, is driving.
That's more like the demos where someone trains on a scene and the neural net can make plausible extensions to the scene as you move the viewpoint. It's more spatial imagination, like the tool in Photoshop that fills in plausible but imaginary backgrounds.
It does handle collisions with the edge of the road. Collisions with other cars don't really work; they mostly disappear. One car splits in half in confusion. The spatial part is making progress, but the temporal part, not so much.
> I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
"AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)
Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".
In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.
I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.
I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.
Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...
You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.
You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...
1 reply →
[dead]
> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
https://www.nobelprize.org/prizes/medicine/2014/press-releas...
It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.
There's a whole giant gap between grid cells and intelligence.
>There's a whole giant gap between grid cells and intelligence.
Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.
[1] Learning produces an orthogonalized state machine in the hippocampus:
https://www.nature.com/articles/s41586-024-08548-w
1 reply →
I kept reading, waiting for a definition of spatial intelligence, but gave up after a few paragraphs. After years of reading VC-funded startup fluff, writing that contain these words tend to put me off now: transform, revolutionize, next frontier, North Star.
She's funded by fascist oligarchs at Sequoia, not hard to connect the dots. Just listen to all the buzzwords and ancient Greek allegories though, totally not a bubble...
Thanks for your article. The references section was interesting.
I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6
and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x
And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells
People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).
What I personally find amusing is this part:
>3. Interactive: World models can output the next states based on input actions
>Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.
That's literally just an RNN (not a transformer). An RNN takes a previous state and an input and produces a new state. If you add a controller on top, it is called model predictive control. The most extreme form I have seen is temporal difference model predictive control (TD-MPC). [0]
[0] https://arxiv.org/abs/2203.04955
The question, as always, is: can we get any useful insights from all of that?
Trying to copy biological systems 1:1 rarely works, and copying biological systems doesn't seem to be required either. CNNs are somewhat brain-inspired, but only somewhat, and LLMs have very little architectural similarity to human brain - other than being an artificial neural network.
This functional similarity of LLMs to the human brain doesn't come from reverse engineered details of how the human brain works - it comes from the training process.
There's nothing similar about LLMs and human brains. Theyre entirely divergent. Training a machine has nothing remotely to do with biological development.
They perform incredibly similar functions. Thus, "functionally similar".
3 replies →
This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?
There is research showing that the grid cells also represent abstract reasoning: https://pmc.ncbi.nlm.nih.gov/articles/PMC5248972/
Deep Mind also did a paper with grid cells a while ago: https://deepmind.google/blog/navigating-with-grid-like-repre...
> if they have anything figured out besides "collect spatial data" like imagenet
I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.
[1] https://www.worldlabs.ai/blog/rtfm
I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lol
As for reflections, I don't get that impression either. They seem extremely brittle to movement.
[0] http://0x0.st/K95T.png
No you just don't understand, don't you see! the ancient greeks foresaw this centuries ago, we are just on the cusp of a world changing moment can't you feel the buzzwords flow through you! First it's creating 7 second meme videos w/ too many arms, then it's right to curing cancer and solving physics! Let the power of buzzwords calm your fears of a bubble.
To decipher if there is anything like spatial intelligence, which is an oxymoronic term at most and redundant at least, one has to decipher the base units of the processes prior to their materialization in the allocortex. And to assign a careful concatenated/parametric categorization of what is unitized, where the processes focus into thresholds etc. This frontier propaganda and the few arvix/nature papers here are too synthetic to lead anywhere of merit.