Comment by cubefox
6 days ago
I think the fundamental idea behind JEPA (not necessarily this concrete Meta implementation) will ultimately be correct: predicting embeddings instead of concrete tokens. That's arguably what animals do. Next-token prediction (a probability distribution over the possible next tokens) works well for the discrete domain of text, but it doesn't work well for a continuous domain like video, which would be needed for real-time robotics.
For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.
Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.
> We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly
There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
That would mean that to learn faster, you want to expose yourself to situations where you are often wrong: be often surprised and go down the wrong paths. Have a feedback mechanism which will tell you when you're wrong. This is maybe also why the best teachers are the ones who often ask the class questions for which there are counter-intuitive answers.
> There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
Yes, and ideally there would be whole backpropagation passes which update the entire model depending on how much the current observation diverges from past predictions. (Though brains use an updating mechanism which diverges from the backpropagation algorithm.)
Edit: Apparently the theory of this is broadly known (apart from "JEPA" and "predictive coding") also under the names "free energy principle" and "active inference": https://en.wikipedia.org/wiki/Free_energy_principle
I'm only a layman but at a high level how does the encoder + predictor of JEPA differ from an LLM?
An LLM takes in input, transforms it into an embedding, and makes predictions off that embedding. The only high-level difference I can see is that currently LLMs do it in a "single pass" where they output tokens directly (and COT is sort of a hack to get reasoning by "looping" in autoregressive output token space), but IIRC there are some experimental variants that do looped latent reasoning.
Any high-level comparison I can find almost strawmans LLMs: yes they take in token embeddings directly, but the first few layers of an LLM almost surely convert that to more abstract embeddings, as seen in repE research. Since the best way to predict is to actually internalize a world model, there's no reason to believe that multimodal LLMs can't make predictions about physical changes in the same way that JEPA claims to. That said JEPA may be able to do it more efficiently, attention almost surely isn't the _optimal_ architecture for doing all this
LLMs simply take in text and return text, therefore they can just be trained via self-supervised learning on large amounts of text. Then they only need a little fine-tuning on top of that, and they are ready.
But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.
Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.
>The robot doesn't just need a prediction about the next frame
But the residual stream of LLMs doesn't "just" encode the next token prediction, it is high-level enough to encode predictions for a few tokens out, as seen with things like Multi-token prediction.
But yes I can see that in terms of input, you probably don't want to take in video frames directly and training via teacher-forcing is probably inefficient here. So some world-model-tailored embedding like JEPA is probably better. I guess my confusion is that Yann seems to frame it as JEPA vs LLM, but to me JEPA just seems like an encoder to generate embeddings that can be fed into an LLM. They seem complementary rather than a substitute.
> Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
This is easily generated synthetically from a kinematic model, at least up to a certain level of precision.
4 replies →
But how do you go from predicting embeddings (which could be thought of as a type of lossy compression of the original data) back out to something usable, say a sequence of image/video tokens or a sequence of robot actions?
A robot model would need to constantly convert the prediction (an embedding) of the future observations, together with a "plan" of what the robot tries to achieve, into an action. Into some kind of movement which takes both the action plan and the predicted sensory data into account.
That's very much an unsolved problem, and I don't know how far Meta is along that path. Not very far, I assume.
If I understand your post correctly, they're also doing this:
> V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
> After the actionless pre-training stage, the model can make predictions about how the world might evolve—however, these predictions don’t directly take into account specific actions that an agent would take. In the second stage of training, we focus on making the model more useful for planning by using robot data, which includes visual observations (video) and the control actions that the robot was executing. We incorporate this data into the JEPA training procedure by providing the action information to the predictor. After training on this additional data, the predictor learns to account for specific actions when making predictions and can then be used for control. We don’t need a lot of robot data for this second phase—in our technical report, we show that training with only 62 hours of robot data already results in a model that can be used for planning and control.
> We demonstrate how V-JEPA 2 can be used for zero-shot robot planning in new environments and involving objects not seen during training. Unlike other robot foundation models—which usually require that some training data come from the specific robot instance and environment where the model is deployed—we train the model on the open source DROID dataset and then deploy it directly on robots in our labs. We show that the V-JEPA 2 predictor can be used for foundational tasks like reaching, picking up an object, and placing it in a new location.
> For short-horizon tasks, such as picking or placing an object, we specify a goal in the form of an image. We use the V-JEPA 2 encoder to get embeddings of the current and goal states. Starting from its observed current state, the robot then plans by using the predictor to imagine the consequences of taking a collection of candidate actions and rating the candidates based on how close they get to the desired goal. At each time step, the robot re-plans and executes the top-rated next action toward that goal via model-predictive control. For longer horizon tasks, such as picking up an object and placing it in the right spot, we specify a series of visual subgoals that the robot tries to achieve in sequence, similar to visual imitation learning observed in humans. With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.
This is where the memory bit comes in, if you have a memory of past embeddings and associated label(s), it could be an ANN query to fetch the most similar embeddings and infer therefrom.
But an embedding is more like a one way hash, kind of like sha1 or md5, no? You can get from input data to a hash value but not the other way around, right? I know that similarly placed embedding vectors will sit next to semantically related vectors but these clusters could be really sparse in such a massively dimensional hyperspace and so the nearest values in a cache may be too far away to be useful?
BTW I'm very much not an expert here and I'm just trying to understand how this system works end to end. Don't take anything I write here as authoritative.
1 reply →
Can you clarify my understanding as a layman please?
Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?
I might be using the jargon incorrectly!
Yes that's right.
The JEPA models give me hope that the future isn't just more tokens, more context, and more chain-of-thought.