Comment by tempusalaria

2 years ago

It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava

0 comments