Comment by krackers

3 months ago

>with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements

Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple vision tokens. So assuming that the embedding size of the vision token and text token is the same `d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor of 10 as quoted in the paper).

That said, if I do experimentally try this by shrinking this very comment down to the smallest font size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it can't be as simple as "each vision token can cover more text", since that only enables better compression if the encoder can actually uncover some sort of redundancy within each token. (And presumably the type of redundancy it uncovers probably isn't something that "classical" compression techniques can exploit, otherwise it seems like it would have been tried by now?).

You should read the 6th page of the paper (and page 5 for architecture breakdown), they show that they are compressing the vision tokens with convolution to keep a strong semantic understanding and keep a small amount of tokens.

But I think it's still experimentall.