Comment by GaggiX

1 year ago

An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).

4 comments

GaggiX

edude03 1 year ago

How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea)

GaggiX 1 year ago

Go to the "Explorations of capabilities" and explore all the capabilities: https://openai.com/index/hello-gpt-4o/
You cannot have this level of control by prompting Dalle, also GPT-4o isn't using Whisper (older GPT-4s yes).
ec109685 1 year ago

At least ChatGPT 4o still looks like it is using dalle.
https://x.com/krishnanrohit/status/1755123169353236848?s=46
hackerlight 1 year ago

Two reasons - the shown capabilities are way beyond what dalle is capable of, and they've been clear that this "omni" model by the "omni team" is natively multimodal