Comment by bhouston

4 months ago

When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?

What is the interface from the top level to the motors?

I feel it can not just be a neural network all the way down, right?

4 comments

bhouston

imtringued 4 months ago

You don't use function calling. You specifically train the neural network to directly encode the robot action as a token. There are many ways. You can output absolute positions, delta positions, relative trajectory. You can do this in joint space or end effector space.

200Hz is barely enough to control a motor, but it is good enough to send a reference signal to a motor controller. Usually what is done is that you have a neural network to learn complex high level behaviour and use that to produce a high level trajectory, then you have a whole body robot controller based on quadratic programming that does things like balancing, maintaining contacts when holding objects or pressing against things. This requires a model of the robot dynamics so that you know the relationship between torques and acceleration. Then after that you will need a motor controller that accepts reference acceleration/torque, velocity and position commands which then is turned into 10kHz to 100kHz pulse width modulated signals by the motor controller. The motor controller itself is driving MOSFETs so it can only turn them on or off, unless you are using expensive sinusoidal drivers.

Philpax 4 months ago

Have a look at the post - it explains how it works. There are two models: a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model. The former produces a latent vector, which is then interpreted by the latter to drive the motors.

NitpickLawyer 4 months ago
> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.
huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.
- whatever1 4 months ago
  
  This is typical in real time applications. A supervisor tries to guess in which region the system is currently and then invokes the correct set of lower level algorithms.