Comment by GaggiX
9 months ago
An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).
How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea)
Go to the "Explorations of capabilities" and explore all the capabilities: https://openai.com/index/hello-gpt-4o/
You cannot have this level of control by prompting Dalle, also GPT-4o isn't using Whisper (older GPT-4s yes).
At least ChatGPT 4o still looks like it is using dalle.
https://x.com/krishnanrohit/status/1755123169353236848?s=46
Two reasons - the shown capabilities are way beyond what dalle is capable of, and they've been clear that this "omni" model by the "omni team" is natively multimodal