Comment by bilsbie

4 months ago

I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.

The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.

It seems like it would be possible to train a multimodal that can do that with low level actuator commands.

1 comment

bilsbie

turnsout 4 months ago

If you read the article, they describe a two-system approach; one "think fast" 80M parameter model running at 200hz to control motion, and one "think slow" 7B parameter model running at ~7-9hz for everything else (scene understanding, language processing, etc).

If that sounds like a cheat, neuroscientists tell us this is how the human brain works.