← Back to context

Comment by krackers

6 days ago

I'm only a layman but at a high level how does the encoder + predictor of JEPA differ from an LLM?

An LLM takes in input, transforms it into an embedding, and makes predictions off that embedding. The only high-level difference I can see is that currently LLMs do it in a "single pass" where they output tokens directly (and COT is sort of a hack to get reasoning by "looping" in autoregressive output token space), but IIRC there are some experimental variants that do looped latent reasoning.

Any high-level comparison I can find almost strawmans LLMs: yes they take in token embeddings directly, but the first few layers of an LLM almost surely convert that to more abstract embeddings, as seen in repE research. Since the best way to predict is to actually internalize a world model, there's no reason to believe that multimodal LLMs can't make predictions about physical changes in the same way that JEPA claims to. That said JEPA may be able to do it more efficiently, attention almost surely isn't the _optimal_ architecture for doing all this

LLMs simply take in text and return text, therefore they can just be trained via self-supervised learning on large amounts of text. Then they only need a little fine-tuning on top of that, and they are ready.

But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.

Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.

Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.

  • >The robot doesn't just need a prediction about the next frame

    But the residual stream of LLMs doesn't "just" encode the next token prediction, it is high-level enough to encode predictions for a few tokens out, as seen with things like Multi-token prediction.

    But yes I can see that in terms of input, you probably don't want to take in video frames directly and training via teacher-forcing is probably inefficient here. So some world-model-tailored embedding like JEPA is probably better. I guess my confusion is that Yann seems to frame it as JEPA vs LLM, but to me JEPA just seems like an encoder to generate embeddings that can be fed into an LLM. They seem complementary rather than a substitute.

  • > Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.

    This is easily generated synthetically from a kinematic model, at least up to a certain level of precision.

    • That would be like trying to pretrain GPT-1 from synthetically generated data only. It probably wouldn't work because the synthetic data doesn't resemble real world data enough.

      It did work for AlphaGo Zero (and later AlphaZero), which were entirely trained on synthetic data. But that's for very simple games with strict formal rules, like Go and chess.

      3 replies →