Comment by blixt
9 months ago
I did something similar when GPT-4V came out, partially with the goal to figure out the input format (I did not get anywhere other than "magic vectors"), but also to roughly estimate the amount of data you can get back out of a 512x512 (the low quality option) image.
What I found is that you can sometimes get more text out of 85-token image than you can out of 85 tokens of text! That said, I think there will be plenty of edge cases where it actually loses some information, and maybe you could argue that if you remove every other word in the text, it could still restore the text.
I never went deeper on this, but I believe there's something clever to be done in the context window with the fact that images are relatively cheap tokens-wise.
The author mentions this in the article, that more than 170 tokens of text can be pulled from an image.
Ah, you're right! My bad!