Comment by pixl97

18 days ago

Um, yea, I call complete bullshit on this. I don't think anyone here on HN is watching what is happening in robotics right now.

Nvidia is out building LLM driven models that work with robot models that simulate robot actions. World simulation was a huge part of AI before LLMs became a thing. With a tight coupling between LLMs and robot models we've seen an explosion in robot capabilities in the last few years.

You know what robots communicate with their actuators and sensors with. Oh yes, binary data. We quite commonly call that words. When you have a set of actions that simulate riding a bicycle in virtual space that can be summarized and described. Who knows if humans can actually read/understand what the model spits out, but that doesn't mean it's invalid.

AI ≠ LLM.

It would be more precise to say that complex world modeling is not done with LLMs, or that LLMs only supplement those world models. Robotics models are AI, calling them LLMs is incorrect (though they may use them internally in places).

The middle "L" in LLM refers to natural language. Calling everything language and words is not some gotcha, and sensor data is nothing like natural language. There are multiple streams / channels, where language is single-stream; sensor data is continuous and must not be tokenized; there are not long-term dependencies within and across streams in the same way that there are in language (tokens thousands of tokens back are often relevant, but sensor data from more than about a second ago is always irrelevant if we are talking about riding a bike), making self-attention expensive and less obviously useful; outputs are multi-channel and must be continuous and realtime, and it isn't even clear the recursive approach of LLMs could work here.

Another good example of world models informed by work in robotics is V-JEPA 2.

https://ai.meta.com/research/vjepa/

https://arxiv.org/abs/2506.09985

Cool, so you agree the language model has to be coupled with a kinematics-sensory-model in order to take the instruct "ride the bike" and turn it into action

My point is merely that the knowledge of how to ride a bike is not necessarily expressible in natural language such that reading a "how to" is sufficient to ride on your first try. It takes falling over a few times to learn how to control your body+bike hybrid.