Comment by throw310822
18 days ago
Funny, because riding a bicycle or speaking a language is exactly something people don't have a world model of. Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue. "Making the right movement at the right time within a narrow boundary of conditions" is a world model, or is it just predicting the next move?
> You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition.
I'm not saying that at all. I am saying that any (sufficiently long, varied) coherent speech needs a world model, so if something produces coherent speech, there must be a world model behind. We can agree that the model is lacking as much as the language productions are incoherent: which is very little, these days.
> Funny, because riding a bicycle or speaking a language is exactly something people don't have a world model of. Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue
This is circular, because you are assuming their world-model of biking can be expressed in language. It can't!
EDIT: There are plenty of skilled experts, artists and etc. that clearly and obviously have complex world models that let them produce best-in-the-world outputs, but who can't express very precisely how they do this. I would never claim such people have no world model or understanding of what they do. Perhaps we have a semantic / definitional issue here?
> This is circular, because you are assuming their world-model of biking can be expressed in language. It can't!
Ok. So I think I get it. For me, producing coherent discourse about things requires a world model, because you can't just make up coherent relationships between objects and actions long enough if you don't understand what their properties are and how they relate to each other.
You, on the other hand, claim that there are infinite firsthand sensory experiences (maybe we can call them qualia?) that fall in between the cracks of language and are rarely communicated (though we use for that a wealth of metaphors and synesthesia) and can only be understood by those who have experienced them firsthand.
I can agree with that if that's what you mean, but at the same time I'm not sure they constitute such a big part of our thought and communication. For example, we are discussing about reality in this thread and yet there are no necessary references to first hand experiences. Any time we talk about history, physics, space, maths, philosophy, we're basically juggling concepts in our heads with zero direct experience of them.
> You, on the other hand, claim that there are infinite firsthand sensory experiences (maybe we can call them qualia?) that fall in between the cracks of language and are rarely communicated (though we use for that a wealth of metaphors and synesthesia) and can only be understood by those who have experienced them firsthand.
Well, not infinite, but, yes! I am indeed claiming much world models are patterns and associations between qualia, and that only some qualia are essentially representable as or look like linguistic tokens (specifically, the sounds of those tokens being pronounced, or their visual shapes if e.g. math symbols). E.g. I am claiming that the way one learns to e.g. cook, or "do theoretical math" may be more about forming associations between those non-linguistic qualia than, say, obviously, doing philosophy is.
> I'm not sure they constitute such a big part of our thought and communication
The communication part is mostly tautological again, but, yes, it remains very much an open question in cognitive science just how exactly thought works. A lot of mathematicians claim to lean heavily on visualization and/or tactile and kinaesthetic modeling for their intuitions (and most deep math is driven by intuition first), but also a lot of mathematicians can produce similar works and disagree about how they think about it intuitively. And we are seeing some progress from e.g. Aristotle using LEAN to generate math proofs in a strictly tokenized / symbolic way, but it remains to be seen if this will ever produce anything truly impressive to mathematicians. So it is really hard to know what actually matters for general human cognition.
I think introspection makes it clear there are a LOT of domains where it is obvious the core knowledge is not mostly linguistic. This is easiest to argue for embodied domains and skills (e.g. anything that requires direct physical interaction with the world), and it is areas like these (e.g. self-driving vehicle AI) where LLMs will be (most likely) least useful in isolation, IMO.
> because you are assuming their world-model of biking can be expressed in language. It can't!
So you can't build an AI model that simulates riding a bike? I'm not stating a LLM model, I'm just saying the kind of AI simulation we've been building virtual worlds with for decades.
So, now that you agree that we can build AI models of simulations, what are those AI models doing. Are they using a binary language that can be summarized?
Obviously you can build an AI model that rides a bike, just not an LLM that does so. Even the transformer architecture would need significant modification to handle the multiple input sensor streams, and this would be continuous data you don't tokenize, and which might not need self-attention, since sensor data doesn't have long-range dependencies like language does. The biking AI model would almost certainly not resemble an LLM very much.
Calling everything "language" is not some gotcha, the middle "L" in LLM means natural language. Binary code is not "language" in this sense, and these terms matter. Robotics AIs are not LLMs, they are just AI.
2 replies →
> Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue.
This works against your argument. Someone who can ride a bike clearly knows how to ride a bike, that they cannot express it in tokenized form speaks to the limited scope ofof written word in representing embodiment.
Yes and no. Riding a bicycle is a skill: your brain is trained to do the right thing and there's some basic feedback loop that keeps you in balance. You could call that a world model if you want, but it's entirely self contained, limited to a very few basic sensory signals (acceleration and balance), and it's outside your conscious knowledge. Plenty of people lack this particular "world model" and can talk about cyclists and bicycles and traffic, and whatnot.
Ok so I don’t understand your assertion. Just because an LLM can talk about acceleration and balance doesn’t mean it could actually control a bicycle without training with the sensory input, embedded in a world that includes more than just text tokens. Ergo, the text does not adequately represent the world.
Um, yea, I call complete bullshit on this. I don't think anyone here on HN is watching what is happening in robotics right now.
Nvidia is out building LLM driven models that work with robot models that simulate robot actions. World simulation was a huge part of AI before LLMs became a thing. With a tight coupling between LLMs and robot models we've seen an explosion in robot capabilities in the last few years.
You know what robots communicate with their actuators and sensors with. Oh yes, binary data. We quite commonly call that words. When you have a set of actions that simulate riding a bicycle in virtual space that can be summarized and described. Who knows if humans can actually read/understand what the model spits out, but that doesn't mean it's invalid.
AI ≠ LLM.
It would be more precise to say that complex world modeling is not done with LLMs, or that LLMs only supplement those world models. Robotics models are AI, calling them LLMs is incorrect (though they may use them internally in places).
The middle "L" in LLM refers to natural language. Calling everything language and words is not some gotcha, and sensor data is nothing like natural language. There are multiple streams / channels, where language is single-stream; sensor data is continuous and must not be tokenized; there are not long-term dependencies within and across streams in the same way that there are in language (tokens thousands of tokens back are often relevant, but sensor data from more than about a second ago is always irrelevant if we are talking about riding a bike), making self-attention expensive and less obviously useful; outputs are multi-channel and must be continuous and realtime, and it isn't even clear the recursive approach of LLMs could work here.
Another good example of world models informed by work in robotics is V-JEPA 2.
https://ai.meta.com/research/vjepa/
https://arxiv.org/abs/2506.09985
Cool, so you agree the language model has to be coupled with a kinematics-sensory-model in order to take the instruct "ride the bike" and turn it into action
My point is merely that the knowledge of how to ride a bike is not necessarily expressible in natural language such that reading a "how to" is sufficient to ride on your first try. It takes falling over a few times to learn how to control your body+bike hybrid.
I don't know how you got this so wrong. In control theory you have to build a dynamical system of your plant (machine, factory, etc). If you have a humanoid robot, you not only need to model the robot itself, which is the easy part actually, you have to model everything the robot is interacting with.
Once you understand that, you realize that the human brain has an internal model of almost everything it is interacting with and replicating human level performance requires the entire human brain, not just isolated parts of it. The reason for this is that since we take our brains for granted, we use even the complicated and hard to replicate parts of the brain for tasks that appear seemingly trivial.
When I take out the trash, organic waste needs to be thrown into the trash bin without the plastic bag. I need to untie the trash bag, pinch it from the other side and then shake it until the bag is empty. You might say big deal, but when you have tea bags or potato peels inside, they get caught on the bag handles and get stuck. You now need to shake the bag in very particular ways to dislodge the waste. Doing this with a humanoid robot is basically impossible, because you would need to model every scrap of waste inside the plastic bag. The much smarter way is to make the situation robot friendly by having the robot carry the organic waste inside a portable plastic bin without handles.