Comment by deepGem

6 days ago

Currently,

You train a VLA (vision language action) model for a specific pair of robotic arms, for a specific task. The end actuator actions are embedded in the model (actions). So let's say you train a pair of arms to pick an apple. You cannot zero shot it to pick up a glass. What you see in demos is the result of lots of training and fine tuning (few shot) on specific object types and with specific robotic arms or bodies.

The language intermediary embedding brings some generalising skills to the table but it isn't much. The vision -> language -> action translation is, how do I put this, brittle at best.

What these guys are showing is a zero shot approach to new tasks in new environments with 80% accuracy. This is a big deal. Pi0 from Physical Intelligence is the best model to compare I think.

0 comments

deepGem

No comments yet

Contribute on Hacker News ↗