Comment by psb217

4 months ago

The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.

2 comments

psb217

mapleshamrock 4 months ago

Couldn't you do something like add a bidirectional encoder after your embedding look up table to compress your text into some smaller token-count semantic space before feeding your transformer blocks to get a similar effect, then?

psb217 4 months ago

Yes, you can get good compression of a long sequence of "base" text tokens into a shorter sequence of "meta" text tokens, where each meta token represents the information from multiple base tokens. But, grouping a fixed number of base tokens into each meta token isn't ideal, since that won't align neatly with sensible semantic boundaries, like words, phrases, sentences, etc. So, the trick is how decide which base tokens should be grouped into each meta token....
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
[1] - https://arxiv.org/abs/2507.07955