Comment by TheAceOfHearts

6 days ago

> With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.

How does this compare with existing alternatives? Maybe I'm just lacking proper context, but a minimum 20% failure rate sounds pretty bad? The paper compares their results with older approaches, which apparently had something like a 15% success rate, so jumping to an 80% success rate does seem like a significant jump. If I'm reading the paper correctly, the amount of time required to compute and execute each action went down from 4 minutes to 16 seconds, which also seems significant.

Having to specify an end goal as an image seems pretty limited, but at least the authors acknowledge it in the paper:

> Second, as mentioned in Section 4, V-JEPA 2-AC currently relies upon tasks specified as image goals. Although this may be natural for some tasks, there are other situations where language-based goal specification may be preferable. Extending the V-JEPA 2-AC to accept language-based goals, e.g., by having a model that can embed language-based goals into the V-JEPA 2-AC representation space, is another important direction for future work. The results described in Section 7, aligning V-JEPA 2 with a language model, may serve as a starting point.

I think it would be interesting if the authors answered whether they think there's a clear trajectory towards a model that can be trained to achieve a >99% success rate.

Currently,

You train a VLA (vision language action) model for a specific pair of robotic arms, for a specific task. The end actuator actions are embedded in the model (actions). So let's say you train a pair of arms to pick an apple. You cannot zero shot it to pick up a glass. What you see in demos is the result of lots of training and fine tuning (few shot) on specific object types and with specific robotic arms or bodies.

The language intermediary embedding brings some generalising skills to the table but it isn't much. The vision -> language -> action translation is, how do I put this, brittle at best.

What these guys are showing is a zero shot approach to new tasks in new environments with 80% accuracy. This is a big deal. Pi0 from Physical Intelligence is the best model to compare I think.

It’s important to keep some perspective: there are zero robots in the wild, at the moment, that use a world model to work on tasks they weren’t specifically trained on. This is cutting edge research and an 80% success rate is astonishing!

  • 80% success rate is also potentially commercially viable if the task is currently being done by a human.

    Work that was once done by 10 humans can now be done by 10 robots + 2 humans for the 20% failure cases, at a lower total cost.

    • This really depends on the failure modes. In general, humans fail in predictable, and mostly safe, ways. AIs fail in highly unpredictable and potentially very dangerous ways. (A human might accidentally drop a knife, an AI might accidentally stab you with it.)

      1 reply →

    • It might still be a little slow (I'm not sure if the 16 seconds to compute an action is fast enough for commercial use cases), but this is definitely exciting and seems like a great step forward.

  • I'm surprised that's not how it's already done. I'd figure some of the inner layers in LLMs were already "world models" and that it's the outer layers that differentiated models between text vs. images/robotics/other modes...

    • That's what the propaganda says, but when we keep explaining it isn't true, and army arrives to repeat adcopy from their favourite tech guru.

      All statistical models of the kind in use are interpolations through historical data -- there's no magic. So when you interpolate through historical texts, your model is of historical text.

      Text is not a measure of the world, to say, "the sky is blue" is not even reliably associated with the blueness of the sky, let alone that the sky isnt blue (there is no sky, and the atmosphere isn't blue).

      These models appear "capture more" only because when you interpret the text you attribute meaning/understanding to it as the cause of its generation -- but that wasnt the cause, this is necessarily an illusion. There is no model of the world in a model of historical text -- there is a model of the world in your head which you associate with text, and that association is exploited when you use LLMs to do more than mere syntax transformation.

      LLMs excel most at "fuzzy retrieval" and things like coding -- the latter is principally a matter of syntax, and the former of recollection. As soon as you require the prompt-completion to maintain "semantic integrity" with non-syntactical/retrivable constraints, it falls apart.

      11 replies →

  • I can buy this, given a very wide meaning of "specifically trained on" and handwaving a bit about "as far as I know*", but then I read the actual wording of "new objects in new and unseen environments.", and remember these were floating around Mountain View doing tasks involving in new objects in novel environments years ago. Then I kinda gotta give up and admit to myself I'm distorting the conversation by emphasizing positivity over ground truth.

  • They don’t use it because it’s unsafe and potentially life threatening lol

    • Plenty of things are unsafe and potentially life threatening, including machines with pre-programmed routines that we use today. We already have robots with limited intelligence interacting safely with humans in workplaces.

      This learning technology didn't exist until this moment in time. That probably has more to do with why no one is using it in the wild.

      1 reply →

I run thousands of robots in production. We can get a very high success rate but only for the task they're designed for. Production robots can't pick up stuff they drop yet. And this '80%' level is not actually acceptable or even state of art for just pick-and-place, but it's compelling for a robot that also knows how to do other things with equal quality (if JEPA does that).

Yeah, I also wonder how old school approaches using machine vision and IK and hard algorithms would compare, or perhaps some hybrid method?

your comment is not aligned with how science is done. For discoveries you certainly work with limited approaches and certainly don't know if there is a "clear trajectory".