V-JEPA 2 world model and new benchmarks for physical reasoning

6 months ago (ai.meta.com)

96 comments

mfiguiere

Interestingly, a small company called Ogma already did something very similar back in 2021 (on an embedded system, no less). This (https://ogma.ai/2021/07/unsupervised-behavioral-learning-ubl...) is a description/video of how they got a small RC car to predict the next frame of its video feed given the action it was about to take, and thereby made the car navigate to a given location when fed with a still frame of that location (all this with online learning, and no backprop).

Instead of vicreg, they induced their latent state with sparse auto-encoding. Also they predicted in pixel, as opposed to latent, space. The white paper describing their tech is a little bit of a mess, but schematically, at least, the hierarchical architecture they describe bears a strong resemblance to the hierarchical JEPA models LeCunn outlined in his big paper from a few years ago. A notable difference, though, is that their thing is essentially a reflex agent, as opposed to possessing a planning/optimization loop.

concrete_head 6 months ago

Just wanted to say thank very much for sharing this.
Over the last few months I've been inventing this almost exact approach in my head as a hobby without consciously knowing it had already been done. I love their little RC car demo.
kadushka 6 months ago

The ideas at Ogma are inspired by Numenta's work.

TheAceOfHearts 6 months ago

> With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.

How does this compare with existing alternatives? Maybe I'm just lacking proper context, but a minimum 20% failure rate sounds pretty bad? The paper compares their results with older approaches, which apparently had something like a 15% success rate, so jumping to an 80% success rate does seem like a significant jump. If I'm reading the paper correctly, the amount of time required to compute and execute each action went down from 4 minutes to 16 seconds, which also seems significant.

Having to specify an end goal as an image seems pretty limited, but at least the authors acknowledge it in the paper:

> Second, as mentioned in Section 4, V-JEPA 2-AC currently relies upon tasks specified as image goals. Although this may be natural for some tasks, there are other situations where language-based goal specification may be preferable. Extending the V-JEPA 2-AC to accept language-based goals, e.g., by having a model that can embed language-based goals into the V-JEPA 2-AC representation space, is another important direction for future work. The results described in Section 7, aligning V-JEPA 2 with a language model, may serve as a starting point.

I think it would be interesting if the authors answered whether they think there's a clear trajectory towards a model that can be trained to achieve a >99% success rate.

deepGem 6 months ago

Currently,
You train a VLA (vision language action) model for a specific pair of robotic arms, for a specific task. The end actuator actions are embedded in the model (actions). So let's say you train a pair of arms to pick an apple. You cannot zero shot it to pick up a glass. What you see in demos is the result of lots of training and fine tuning (few shot) on specific object types and with specific robotic arms or bodies.
The language intermediary embedding brings some generalising skills to the table but it isn't much. The vision -> language -> action translation is, how do I put this, brittle at best.
What these guys are showing is a zero shot approach to new tasks in new environments with 80% accuracy. This is a big deal. Pi0 from Physical Intelligence is the best model to compare I think.
ricardobeat 6 months ago
It’s important to keep some perspective: there are zero robots in the wild, at the moment, that use a world model to work on tasks they weren’t specifically trained on. This is cutting edge research and an 80% success rate is astonishing!
- londons_explore 6 months ago
  
  80% success rate is also potentially commercially viable if the task is currently being done by a human.
  Work that was once done by 10 humans can now be done by 10 robots + 2 humans for the 20% failure cases, at a lower total cost.
  
  3 replies →
- vFunct 6 months ago
  
  I'm surprised that's not how it's already done. I'd figure some of the inner layers in LLMs were already "world models" and that it's the outer layers that differentiated models between text vs. images/robotics/other modes...
  
  12 replies →
- refulgentis 6 months ago
  
  I can buy this, given a very wide meaning of "specifically trained on" and handwaving a bit about "as far as I know*", but then I read the actual wording of "new objects in new and unseen environments.", and remember these were floating around Mountain View doing tasks involving in new objects in novel environments years ago. Then I kinda gotta give up and admit to myself I'm distorting the conversation by emphasizing positivity over ground truth.
- gyudin 6 months ago
  
  They don’t use it because it’s unsafe and potentially life threatening lol
  
  2 replies →
DickingAround 6 months ago

I run thousands of robots in production. We can get a very high success rate but only for the task they're designed for. Production robots can't pick up stuff they drop yet. And this '80%' level is not actually acceptable or even state of art for just pick-and-place, but it's compelling for a robot that also knows how to do other things with equal quality (if JEPA does that).
torginus 6 months ago

Yeah, I also wonder how old school approaches using machine vision and IK and hard algorithms would compare, or perhaps some hybrid method?
robot 6 months ago

your comment is not aligned with how science is done. For discoveries you certainly work with limited approaches and certainly don't know if there is a "clear trajectory".

cubefox 6 months ago

I think the fundamental idea behind JEPA (not necessarily this concrete Meta implementation) will ultimately be correct: predicting embeddings instead of concrete tokens. That's arguably what animals do. Next-token prediction (a probability distribution over the possible next tokens) works well for the discrete domain of text, but it doesn't work well for a continuous domain like video, which would be needed for real-time robotics.

For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.

Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.

kaivi 6 months ago
> We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly
There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
That would mean that to learn faster, you want to expose yourself to situations where you are often wrong: be often surprised and go down the wrong paths. Have a feedback mechanism which will tell you when you're wrong. This is maybe also why the best teachers are the ones who often ask the class questions for which there are counter-intuitive answers.
- cubefox 6 months ago
  
  > There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
  Yes, and ideally there would be whole backpropagation passes which update the entire model depending on how much the current observation diverges from past predictions. (Though brains use an updating mechanism which diverges from the backpropagation algorithm.)
  Edit: Apparently the theory of this is broadly known (apart from "JEPA" and "predictive coding") also under the names "free energy principle" and "active inference": https://en.wikipedia.org/wiki/Free_energy_principle
krackers 6 months ago
I'm only a layman but at a high level how does the encoder + predictor of JEPA differ from an LLM?
An LLM takes in input, transforms it into an embedding, and makes predictions off that embedding. The only high-level difference I can see is that currently LLMs do it in a "single pass" where they output tokens directly (and COT is sort of a hack to get reasoning by "looping" in autoregressive output token space), but IIRC there are some experimental variants that do looped latent reasoning.
Any high-level comparison I can find almost strawmans LLMs: yes they take in token embeddings directly, but the first few layers of an LLM almost surely convert that to more abstract embeddings, as seen in repE research. Since the best way to predict is to actually internalize a world model, there's no reason to believe that multimodal LLMs can't make predictions about physical changes in the same way that JEPA claims to. That said JEPA may be able to do it more efficiently, attention almost surely isn't the _optimal_ architecture for doing all this
- cubefox 6 months ago
  
  LLMs simply take in text and return text, therefore they can just be trained via self-supervised learning on large amounts of text. Then they only need a little fine-tuning on top of that, and they are ready.
  But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
  Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.
  Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.
  
  6 replies →
abraxas 6 months ago
But how do you go from predicting embeddings (which could be thought of as a type of lossy compression of the original data) back out to something usable, say a sequence of image/video tokens or a sequence of robot actions?
- cubefox 6 months ago
  
  A robot model would need to constantly convert the prediction (an embedding) of the future observations, together with a "plan" of what the robot tries to achieve, into an action. Into some kind of movement which takes both the action plan and the predicted sensory data into account.
  That's very much an unsolved problem, and I don't know how far Meta is along that path. Not very far, I assume.
  
  1 reply →
- bobosha 6 months ago
  
  This is where the memory bit comes in, if you have a memory of past embeddings and associated label(s), it could be an ANN query to fetch the most similar embeddings and infer therefrom.
  
  2 replies →
bytefactory 6 months ago
Can you clarify my understanding as a layman please?
Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?
I might be using the jargon incorrectly!
- cubefox 6 months ago
  
  Yes that's right.
nulld3v 6 months ago

The JEPA models give me hope that the future isn't just more tokens, more context, and more chain-of-thought.

siavosh 6 months ago

Does someone know how the "semantic" embeddings are learned? That seems like perhaps the main technical challenge here.

gglon 6 months ago

From the paper, section 2.1: minimize_θ,φ,Δ ||P_φ(Δ, E_θ(x)) - sg(E_θ'(y))||_1
where
y - full video, x - masked video, E_θ(.) - learned encoder (semantic embedding), P_φ(.) - learned predictor, Δ - learned mask (which patches in a video where dropped), sg(.) - stop gradient to prevent change, gradient propagation in E_θ'(.), which in turn is an exponential moving average of E_θ(.) ie. θ'_new <- τ θ'_old + (1-τ) θ. So the loss is applied only to the predictions of the masked patches while the encoder of full video follows the learned one. This asymmetry in learning prevents collapse of the encoder to a trivial constant.

fidotron 6 months ago

You have to wonder if the model is going to end up recreating Verlet integration in there somewhere, or if it's generating a pile of those optical acceleration cancelation type heuristics in neural net form.

It's one of those ideas I've had around for a while that if you fused decent object tracking with an understanding of Verlet integration you should, in principle, start being able to measure all sorts of physical quantities quite easily.

rar00 6 months ago

the robot arm demonstration video jumps at the 00:28s mark...

jcelerier 6 months ago

> That kind of physical intuition isn’t something adults obtain after years of education—young children develop this intuition by observing the world around them before they can even speak in full sentences.

I mean, it still takes them much more time than it takes to train even the largest LLMs we use (a couple months)

dist-epoch 6 months ago
In wall clock time. If you count in input tokens/pixels, humans learn with orders of magnitude less input data.
- naasking 6 months ago
  
  Humans do not start as blank models, they have billions of years of pretraining from evolution.
- logicchains 6 months ago
  
  That's not true at all; the amount of audiovisual data a human is exposed to in even just one year is incredibly vast. Over sixty frames per second, sixteen hours per day gives over a trillion frames per year, and each frame at such a high resolution would be hundreds of tokens.
  
  6 replies →
lukan 6 months ago

But they use way less energy for it.

artificialprint 6 months ago

Throw ARC-AGI 2 at it!

jadbox 6 months ago
I suspect it wouldn't help too much. This model is meant for physics-based world modeling, while nearly all the problems in ARC are symbolic reasoning.
- artificialprint 6 months ago
  
  I'd say world modeling can provide the foundations from which symbolic reasoning can emerge, after all this is how we (humans) learn it too. There are a lot of tasks in arc that are grounded in simple physics
  
  1 reply →
falcor84 6 months ago

Yes, ARC-AGI 2 seems to game a lot of challenges that involve a (projection of) gravity and collisions, so I'd be quite interested in seeing whether it would generalize.

nlitened 6 months ago

I imagine that Russian-speaking team members had fun with naming the model V-JEPA

Tiberium 6 months ago
For the curious: "жопа" (which "JEPA" sounds like) means "ass" in Russian. Also V ("В") means "in" (although if we get into specifics, the casing would need to be "жопу" or "жопе" depending on the context)
- koakuma-chan 6 months ago
  
  Also the video thumbnail:
  J.E.P.A.

momojo 6 months ago

Why is Meta investing into this research? What's the potential payoff?

MindTheAbstract 6 months ago

Like others have said, its an interesting avenue for AGI. The joint embeddings would be closer to thinking than the current LLM token work. LLMs look like they have a lot limitations for AGI (although who knows if we have another crazy scale up? but that extra scale is looking difficult right now).
esafak 6 months ago

There is a world of money in AGI, and they have the resources, and notably the data, to achieve it.
aaroninsf 6 months ago

The goal is a a Large Phenomenological Model.
A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.
Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.
LLM of course trade in language tokens.
We can extend their behavior with front ends which convert other media types into such tokens.
But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.
With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.
But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)
The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.
Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....
When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...
...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.
DesiLurker 6 months ago

possibly with their investment into AR/VR and gaming they may see a pathway to creating 'physical intelligence' and tap into a much bigger untapped market. I mean isn't Robotaxi the main carrot Musk's been holding in front of tesla investors for decade or so. physical robots may provide a more 'incremental fault tolerant' path to application of AI.
dyauspitr 6 months ago

Physical robots as impressive as LLMs?
kp1197 6 months ago

Robots that can do anything.
seydor 6 months ago

physical robots arguing endlessly with physical people

iLoveOncall 6 months ago

"World model" and "physical reasoning" is such a lie.

Those models don't have any understanding of physics, they just regurgitate what they see in their vision-based training set, just like any image or video generation model does.

Monkey see other monkey cannot go through wall, monkey don't try go through wall.

smokel 6 months ago

I think you are misinterpreting the terminology.
Of course these models are not understanding physics in the way a physicists or a mathematician would. But they do form a model of the world that can be used for forecasting and reasoning, in a way possibly not much unlike how humans and other animals operate when interacting with the physical world.
dghlsakjg 6 months ago

You don't need to have taken a single physics class to be good at pool...
rayboy1995 6 months ago
> Monkey see other monkey cannot go through wall, monkey don't try go through wall.
I mean... we are just monkeys. Did we not learn this way when we were younger?
- RollingRo11 6 months ago
  
  Agreed! A really young child has no notion of "physics". They are learning through experience and observation.
  These models/robots aren't superintelligent by any means, but "Monkey see other monkey cannot go through wall, monkey don't try go through wall" isn't far off from how some animals/humans "learn".
  
  1 reply →
seydor 6 months ago

physics is phenomenological. the model sees phenomena

ldjkfkdsjnv 6 months ago

Leadership at meta is dropping the ball with these non llm ai model sidequests

jadbox 6 months ago
LLMs where once a side quest. I hope meta invests more in alternatives as maybe we'll find something better. If not, then meta just loses a bit of R&D budget. They are still heavily invested in regular LLM development, so it's not like they are trading one for the other.
- linguistbreaker 6 months ago
  
  I strongly agree. FAANG has the money to do the research. LLMs are far from intelligent - AGI will require a number of other advances.
rvz 6 months ago

AI research is more than just LLMs.
energy123 6 months ago

Is this a sarcastic compliment? Diversity in research agendas is very important for pushing forward the frontier even if it's not good for the company investing in the high risk research. Good job, to an otherwise toxic company.