Comment by SanderNL
3 years ago
I’m waiting for GPT4’s image API. From what I understand it’s not just a “image2text” descriptor that then “reasons” on this description, right?
It’s just grokking an image directly. Were the pixels tokenized somehow? I’m very curious what that does to a model like this.
Can somebody that actually knows anything clue me in?
It's possible to take a text only model and ground it with images. examples are
blip-2(https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
fromage(https://github.com/kohjingyu/fromage)
prismer(https://github.com/NVlabs/prismer)
palm-e(https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...)
now assuming gpt-4 vision isn't just some variant of mm-react(ie what you're describing), that's what's happening here. https://github.com/microsoft/MM-REACT
images can be tokenized. so what happens usually is that extra parameters are added to a frozen model and those parameters are trained on an image embedding to text embedding task. the details vary of course but that's a fairly general overview of what happens.
the image to text task the models get trained to do has its issues. it's lossy and not very robust. gpt-4 on the other hand looked incredibly robust. they may not be doing that. idk
Very interesting, thanks.
No worries. Like I said, that was just a general overview.
Strictly speaking the model doesn't have to be frozen (though unfreezing tends to make the original model perform much worse at NLP tasks) and the task isn't necessarily just image to text (Palm e for example trains to extract semantic information from objects in an image as well)
GPT-4's architecture is a trade secret, but vision transformers tokenize patches of images. Something like 8x8 or 32x32 pixel patches, rather than individual pixels.
Multi-model text-image transformers add these tokens right beside the text tokens. So there is both transfer-learning and similarity graphed between text and image tokens. As far as the model knows they're all just tokens. It can't tell the difference between the two.
For the model, the tokens for the words blue/azure/teal and all the tokens for image patches with blue are just tokens with a lot of similarity. It doesn't know if the token its being fed is text, image, or even audio or other sensory data. All tokens are just a number with associated weights to a transformer, regardless of what they represent to us.
The GPT-4 vision API is actually in production and in at least two public products already. https://www.bemyeyes.com/ and https://www.microsoft.com/en-us/ai/seeing-ai
I’d be surprised if that doesn’t do something qualitatively with the model. Very cool, curious to see what’s possible. Thanks.