Comment by famouswaffles

3 years ago

It's possible to take a text only model and ground it with images. examples are

blip-2(https://github.com/salesforce/LAVIS/tree/main/projects/blip2)

fromage(https://github.com/kohjingyu/fromage)

prismer(https://github.com/NVlabs/prismer)

palm-e(https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...)

now assuming gpt-4 vision isn't just some variant of mm-react(ie what you're describing), that's what's happening here. https://github.com/microsoft/MM-REACT

images can be tokenized. so what happens usually is that extra parameters are added to a frozen model and those parameters are trained on an image embedding to text embedding task. the details vary of course but that's a fairly general overview of what happens.

the image to text task the models get trained to do has its issues. it's lossy and not very robust. gpt-4 on the other hand looked incredibly robust. they may not be doing that. idk

2 comments

famouswaffles

SanderNL 3 years ago

Very interesting, thanks.

famouswaffles 3 years ago

No worries. Like I said, that was just a general overview.
Strictly speaking the model doesn't have to be frozen (though unfreezing tends to make the original model perform much worse at NLP tasks) and the task isn't necessarily just image to text (Palm e for example trains to extract semantic information from objects in an image as well)