Comment by bilsbie
2 days ago
I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.
The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.
It seems like it would be possible to train a multimodal that can do that with low level actuator commands.
If you read the article, they describe a two-system approach; one "think fast" 80M parameter model running at 200hz to control motion, and one "think slow" 7B parameter model running at ~7-9hz for everything else (scene understanding, language processing, etc).
If that sounds like a cheat, neuroscientists tell us this is how the human brain works.