Comment by bglazer

13 hours ago

I genuinely did not expect to see a robot handling clothing like this within the next ten years at least. Insanely impressive

I do find it interesting that they state that each task is done with a fine tuned model. I wonder if that’s a limitation of the current data set their foundation model is trained on (which is what I think they’re suggesting in the post) or if it reflects something more fundamental about robotics tasks. It does remind me of a few years ago in LLMs when fine tuning was more prevalent. I don’t follow LLM training methodology closely but my impression was that the bulk of recent improvements have come from better RL post training and inference time reasoning.

Obviously they’re pursuing RL and I’m not sure spending more tokens at inference would even help for fine manipulation like this, notwithstanding the latency problems with that.

So, maybe the need for fine tuning goes away with a better foundation model like they’re suggesting? I hope this doesn’t point towards more fundamental limitations on robotics learning with the current VLA foundation model architectures

Looks like we may actually have robot maids picking stuff up before too long!

Re. not expecting it for ten years at least, current progress is pretty much in line with Moravec's predictions from 35 years ago. (https://jetpress.org/volume1/moravec.htm)

I wonder if he still follows this stuff?

  • > robot maids

    What fascinates me is we could probably make self-folding clothes. We also already have non wrinkle clothes where folding is minimally needed. I wager we could go a lot further if we invested a tad more into the matter.

    But the first image people seem to have of super advanced multi-thousand dollar robots is still folding the laundry.

    • I think it's just one of the most obvious things that Rosey from the Jetsons could do but current robots can't.

There's a lot of indications that robotics AI is in a data-starved regime - which means that future models are likely to attain better 0-shot performance, solve more issues in-context, generalize better, require less task-specific training, and be more robust.

But it seems like a degree of "RL in real life" is nigh-inevitable - imitation learning only gets you this far. Kind of like RLVR is nigh-inevitable for high LLM performance on agentic tasks, and for many of the same reasons.

To be clear, the video at the top of the article is 4x speed and the clothes folding section is full of cut scenes.