Comment by uxhacker

7 months ago

Because some LLMs are now multimodal—they can process and generate not just text, but also sound and visuals. In other words, they’re beginning to handle a broader range of human inputs and outputs, much like we do.

3 comments

uxhacker

PaulDavisThe1st 7 months ago

Those are not LLMs. They use the same foundational technology (pick what you like, but I'd say transformers) to accomplish tasks that require entirely different training data and architectures.

I was specifically asking about LLMs because the comment I replied to only talked about LLMs - Large Language Models.

Karrot_Kream 7 months ago

At this point in time calling a multimodal LLM an LLM is pretty uncontroversial. Most of the differences lie in the encoders and embedding projections. If anything I'd think MoE models are actually more different from a basic LLM than a multimodal LLM is from a regular LLM.
Bottom line is that when folks are talking about LLM applications, multimodal LLMs, MoE LLMs, and even agents are all in the general umbrella.
jnwatson 7 months ago

Multimodal LLMs are absolutely LLMs, the language is just not human language.