Comment by famouswaffles
11 hours ago
>Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
>SOTA typically refers to achieving the best performance
Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.
No comments yet
Contribute on Hacker News ↗