Comment by tempusalaria
2 years ago
It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava
2 years ago
It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava
No comments yet
Contribute on Hacker News ↗