Comment by lithocarpus
18 days ago
I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?
But I know very little about this.
18 days ago
I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?
But I know very little about this.
you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models. It's the so-called latent space and people who focus on next token prediction completely missed the point that all the interesting thinking takes place in abstract world model space.
> you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models.
This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.
E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.