Comment by lelag

2 days ago

Some more thoughts about training a manipulation model: I would add that synthetic data might be key to making it happen.

One issue is that most video is not shot in first person, so it might make for a poor dataset for the agentic part assuming the robot has human like vision.

Still if you have a large data set of motion capture data with reasonably accurate finger mouvement, you could use a video diffusion model with a control net to get a realistic looking video of a specific motion in first person. Another way would be to use a model like dust3r to generate a geometric 3d scene from the initial video allowing to change the camera angle to match a first person view.

This could be used as the dataset for the agentic model.

Now, maybe human like vision is not even necessary, unlike human, there is nothing preventing your robot to see through external camera placed around the house. Hell, there's even a good chance, your robot's brain will live in a datacenter hundreds of mile away.