← Back to context

Comment by MacsHeadroom

3 years ago

GPT-4's architecture is a trade secret, but vision transformers tokenize patches of images. Something like 8x8 or 32x32 pixel patches, rather than individual pixels.

Multi-model text-image transformers add these tokens right beside the text tokens. So there is both transfer-learning and similarity graphed between text and image tokens. As far as the model knows they're all just tokens. It can't tell the difference between the two.

For the model, the tokens for the words blue/azure/teal and all the tokens for image patches with blue are just tokens with a lot of similarity. It doesn't know if the token its being fed is text, image, or even audio or other sensory data. All tokens are just a number with associated weights to a transformer, regardless of what they represent to us.

The GPT-4 vision API is actually in production and in at least two public products already. https://www.bemyeyes.com/ and https://www.microsoft.com/en-us/ai/seeing-ai

I’d be surprised if that doesn’t do something qualitatively with the model. Very cool, curious to see what’s possible. Thanks.