← Back to context

Comment by embedding-shape

5 days ago

> But language is the input and the vector space within which their knowledge is encoded and stored. The don't have a concept of a duck beyond what others have described the duck as.

I guess if we limit ourselves to "one-modal LLMs" yes, but nowadays we have multimodal ones, who could think of a duck in the way of language, visuals or even audio.

You don’t understand. If humans had no words to describe a duck, they would still know what a duck is. Without words, LLMs would have no way to map an encounter with a duck to anything useful.

  • Which makes sense for text LLMs yes, but what about LLMs that deal with images? How can you tell they wouldn't work without words? It just happens to be words we use for interfacing with them, because it's easy for us to understand, but internally they might be conceptualizing things in a multitude of ways.

    • Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.

      If you didn't know the word "duck", you could still see the duck, hunt the duck, use the ducks feather's for your bedding and eat the duck's meat. You would know it could fly and swim without having to know what either of those actions were called.

      The LLM "sees" a thing, identifies it as a "duck", and then depends on a single modal LLM to tell it anything about ducks.

      1 reply →