← Back to context

Comment by jazzyjackson

18 days ago

> Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue.

This works against your argument. Someone who can ride a bike clearly knows how to ride a bike, that they cannot express it in tokenized form speaks to the limited scope ofof written word in representing embodiment.

Yes and no. Riding a bicycle is a skill: your brain is trained to do the right thing and there's some basic feedback loop that keeps you in balance. You could call that a world model if you want, but it's entirely self contained, limited to a very few basic sensory signals (acceleration and balance), and it's outside your conscious knowledge. Plenty of people lack this particular "world model" and can talk about cyclists and bicycles and traffic, and whatnot.

  • Ok so I don’t understand your assertion. Just because an LLM can talk about acceleration and balance doesn’t mean it could actually control a bicycle without training with the sensory input, embedded in a world that includes more than just text tokens. Ergo, the text does not adequately represent the world.

Um, yea, I call complete bullshit on this. I don't think anyone here on HN is watching what is happening in robotics right now.

Nvidia is out building LLM driven models that work with robot models that simulate robot actions. World simulation was a huge part of AI before LLMs became a thing. With a tight coupling between LLMs and robot models we've seen an explosion in robot capabilities in the last few years.

You know what robots communicate with their actuators and sensors with. Oh yes, binary data. We quite commonly call that words. When you have a set of actions that simulate riding a bicycle in virtual space that can be summarized and described. Who knows if humans can actually read/understand what the model spits out, but that doesn't mean it's invalid.

  • AI ≠ LLM.

    It would be more precise to say that complex world modeling is not done with LLMs, or that LLMs only supplement those world models. Robotics models are AI, calling them LLMs is incorrect (though they may use them internally in places).

    The middle "L" in LLM refers to natural language. Calling everything language and words is not some gotcha, and sensor data is nothing like natural language. There are multiple streams / channels, where language is single-stream; sensor data is continuous and must not be tokenized; there are not long-term dependencies within and across streams in the same way that there are in language (tokens thousands of tokens back are often relevant, but sensor data from more than about a second ago is always irrelevant if we are talking about riding a bike), making self-attention expensive and less obviously useful; outputs are multi-channel and must be continuous and realtime, and it isn't even clear the recursive approach of LLMs could work here.

    Another good example of world models informed by work in robotics is V-JEPA 2.

    https://ai.meta.com/research/vjepa/

    https://arxiv.org/abs/2506.09985

  • Cool, so you agree the language model has to be coupled with a kinematics-sensory-model in order to take the instruct "ride the bike" and turn it into action

    My point is merely that the knowledge of how to ride a bike is not necessarily expressible in natural language such that reading a "how to" is sufficient to ride on your first try. It takes falling over a few times to learn how to control your body+bike hybrid.