Comment by cubefox
5 days ago
LLMs simply take in text and return text, therefore they can just be trained via self-supervised learning on large amounts of text. Then they only need a little fine-tuning on top of that, and they are ready.
But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.
Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.
>The robot doesn't just need a prediction about the next frame
But the residual stream of LLMs doesn't "just" encode the next token prediction, it is high-level enough to encode predictions for a few tokens out, as seen with things like Multi-token prediction.
But yes I can see that in terms of input, you probably don't want to take in video frames directly and training via teacher-forcing is probably inefficient here. So some world-model-tailored embedding like JEPA is probably better. I guess my confusion is that Yann seems to frame it as JEPA vs LLM, but to me JEPA just seems like an encoder to generate embeddings that can be fed into an LLM. They seem complementary rather than a substitute.
> Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
This is easily generated synthetically from a kinematic model, at least up to a certain level of precision.
That would be like trying to pretrain GPT-1 from synthetically generated data only. It probably wouldn't work because the synthetic data doesn't resemble real world data enough.
It did work for AlphaGo Zero (and later AlphaZero), which were entirely trained on synthetic data. But that's for very simple games with strict formal rules, like Go and chess.
A kinematic model of the robot is a physics simulation of the robot. I don't see why that wouldn't resemble real world data enough.
2 replies →