Comment by avaer
9 hours ago
Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).
But I don't see how it's the wrong tool given the goal.
SOTA typically refers to achieving the best performance, not using the trendiest thing regardless of performance. There is some subtlety here. At some point an LLM might give the best performance in this task, but that day is not today, so an LLM is not SOTA, just trendy. It's kinda like rewriting something in Rust and calling it SOTA because that's the trend right now. Hope that makes sense.
>Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
>SOTA typically refers to achieving the best performance
Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.
I don’t think trendy is really the right word and maybe it’s not state of the art but a lot of us in the industry are seeing emerging capabilities that might make it SOTA. Hope that makes sense.
LLMs are indeed the definition of trendy (I've found using Google Trends to dive in is a good entry point to get a broad sense of whether something is "trendy")! Basically the right way to think about it is that something can be promising, and demonstrate emerging capabilities, but but those things don't make something SOTA, nor do they make it trendy. They can be related though (I expect everything SOTA was once promising and emerging, but not everything promising or emerging became SOTA). It's a subtlety that isn't super easy to grasp, but (and here is one area I think an LLM can show promise) an LLM like ChatGPT can help unpick the distinctions here. Still, it's slightly nuanced and I understand the confusion.
1 reply →