Comment by blixt

9 months ago

I went through a similar journey back when GPT-4V came out. Here's an additional puzzle for you: GPT-4V knows the exact pixel dimensions of the image (post-resize since there is a max size for images in the pipeline, besides 512x512), but I'm 99% sure it's not provided as text tokens. How am I so sure? It's easy to get GPT to divulge everything from system prompt to tool details, etc. but I've tried every trick in the book and then some, multiple times over, and there is no way to get it to quote the dimensions as text. The only way to get it to give you the dimensions is to tell it to output a structure that contains width and height and just pick something reasonable, and they will "randomly" be the correct values:

https://x.com/blixt/status/1722298733470024076

> It's easy to get GPT to divulge everything from system prompt to tool details,

It's easy enough to get it to hallucinate those things. It doesn't actually tell them to you.

  • I'm well aware of that, but there are plenty of ways to induce verbatim quoting from "hidden" information, and mostly verify it (through sampling a large number of times in separate runs).

    Models are improving in truly hiding or ignoring information these days though. As the author of the article states, you'll have a hard time tricking GPT-4o to read text in images as instructions, most likely thanks to this research: https://openai.com/index/the-instruction-hierarchy/

    I do feel pretty confident that when the model is happily spitting out its system prompt, and all metadata around the image, but not its pixel dimensions, that probably those dimensions were not provided in any system/assistant/tool message. So maybe part of the image embeddings also encode the pixel dimensions somehow (it would also help the model not think of the image as a squished square for non-1:1 images that have been resized to 512x512).

Perhaps images aren’t tokens at all… and 170 tokens is just an approximation of the compute cost.

  • I think that would have pretty serious implications for the transformer architecture though. If they're not embedded like text tokens, how would attention, etc work? And a conversation with multiple images back and forth? Not to mention with GPT-4o now having audio support as well. I would assume it does become tokens.