← Back to context

Comment by plipt

2 days ago

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

It looks pretty obvious (I think):

1. S2 is a 7B VLM, it is responsible for taken in camera streams (from however many of them), run through prompt guided text generation, and before the lm_head (or a few layers leading to it), directly take the latent encoding;

2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;

Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.

It looks plausible, but I am still skeptical about the generalization claim given it is all fine-tuned with household tasks. But nowadays, it is really difficult to understand how these models generalize.

I'm very far from an expert, but:

  What part of this system understands 3 dimensional space of that kitchen?

The visual model "understands" it most readily, I'd say -- like a traditional Waymo CNN "understands" the 3D space of the road. I don't think they've explicitly given the models a pre-generated pointcloud of the space, if that's what you're asking. But maybe I'm misunderstanding?

  How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

It appears that the robot is being fed plain english instructions, just like any VLM would -- instead of the very common `text+av => text` paradigm (classifiers, perception models, etc), or the less common `text+av => av` paradigm (segmenters, art generators, etc.), this is `text+av => movements`.

Feeding the robots the appropriate instructions at the appropriate time is a higher-level task than is covered by this demo, but I think is pretty clearly doable with existing AI techniques (/a loop).

  How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

If your question is "where's the GPUs", their "AI" marketing page[1] pretty clearly implies that compute is offloaded, and that only images and instructions are meaningfully "on board" each robot. I could see this violating the understanding of "totally local" that you mentioned up top, but IMHO those claims are just clarifying that the individual figures aren't controlled as one robot -- even if they ultimately employ the same hardware. Each period (7Hz?) two sets of instructions are generated.

[1] https://www.figure.ai/ai

  What possible combo of model types are they stringing together? Or is this something novel?

Again, I don't work in robotics at all, but have spent quite a while cataloguing all the available foundational models, and I wouldn't describe anything here as "totally novel" on the model level. Certainly impressive, but not, like, a theoretical breakthrough. Would love for an expert to correct me if I'm wrong, tho!

EDIT: Oh and finally:

  Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.

They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)

https://www.reuters.com/technology/artificial-intelligence/r...

  •     their "AI" marketing page[1] pretty clearly implies that compute is offloaded
    

    I think that answers most of my questions.

    I am also not in robotics, so this demo does seem quite impressive to me but I think they could have been more clear on exactly what technologies they are demonstrating. Overall still very cool.

    Thanks for your reply