Comment by tempusalaria
9 months ago
It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava
9 months ago
It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava
No comments yet
Contribute on Hacker News ↗