Comment by famouswaffles
3 years ago
It's possible to take a text only model and ground it with images. examples are
blip-2(https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
fromage(https://github.com/kohjingyu/fromage)
prismer(https://github.com/NVlabs/prismer)
palm-e(https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...)
now assuming gpt-4 vision isn't just some variant of mm-react(ie what you're describing), that's what's happening here. https://github.com/microsoft/MM-REACT
images can be tokenized. so what happens usually is that extra parameters are added to a frozen model and those parameters are trained on an image embedding to text embedding task. the details vary of course but that's a fairly general overview of what happens.
the image to text task the models get trained to do has its issues. it's lossy and not very robust. gpt-4 on the other hand looked incredibly robust. they may not be doing that. idk
Very interesting, thanks.
No worries. Like I said, that was just a general overview.
Strictly speaking the model doesn't have to be frozen (though unfreezing tends to make the original model perform much worse at NLP tasks) and the task isn't necessarily just image to text (Palm e for example trains to extract semantic information from objects in an image as well)