← Back to context

Comment by ygouzerh

10 months ago

The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like "Pick Place Hotdog Sausage" the success rate is passing from 2/10 to 6/10

"Pick Place Hotdog Sausage" is such a bizarre name, though. Is it meant to be human readable? AI-readable? Just a label for the researchers? Same with "Put Mushroom Place Pot". As far as I can see both labels are only used in this Magma paper, nowhere else that Google can find.

  • "Pick & place" is a term for a kind of robot that can pick up scattered items from a conveyor belt and arrange them in a regular fashion.

    The really fast multi-arm versions can be hypnotic to watch. You can see an example at 1:00 in this video: https://youtu.be/aPTd8XDZOEk

    The limitation of industrial pick & place robots is that they're configured for a single task, and reconfiguring them for a new product is notoriously expensive.

    Magma's "pick & place" demo is much slower and shakier than a specialized industrial robot. But Magma can apparently be adapted to a new task by providing plain English instructions.