Comment by D-Machine
18 days ago
> you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models.
This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.
E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.
No comments yet
Contribute on Hacker News ↗