Comment by notepad0x90
13 days ago
LLMs can do chat-completion, they don't do only chat completion. There are LLMs for image generation, voice generation, video generation and possibly more. The camera of a drone inputs images for the LLM, then it determines what action take based on that. Similar to if you asked ChatGPT "there is a tree in this picture, if you were operating a drone, what action would you take to avoid collision", except the "there is a tree" part is done by the LLMs image recognition, and the sys prompt is "recognize objects and avoid collision", of course I'm simplifying it a lot but it is essentially generating navigational directions under a visual context using image recognition.
> There are LLMs for image generation,
That part isn’t handled by an LLM
> voice generation,
That part isn’t handled by an LLM
> video generation
That part isn’t handled by an LLM
Yes it can be, and often is. Advanced voice mode in chatGPT and the voice mode in Gemini are LLMs. So is the image gen in both chatGPT and Gemini (Nano Banana).
What is it handled by? I'm honestly curious, there are models specifically labeled as for those tasks.