Comment by sd9
3 months ago
> more information compression (see paper) => shorter context windows, more efficiency
It seems crazy to me that image inputs (of text) are smaller and more information dense than text - is that really true? Can somebody help my intuition?
See this thread https://news.ycombinator.com/item?id=45640720
As I understood the responses, the benefit comes from making better use of the embedding space. BPE tokenization is basically like a fixed lookup table, whereas when you form "image tokens" you just throw each 16x16 patch into a neural-net and (handwave) out comes your embedding. From that, it should be fairly intuitive that since current text tokenization embedding vectors won't even form a subspace (it can only just be ~$VOCAB_SIZE points), image tokens have the capacity to be more information dense. And you might hope that the neural network can somehow make use of that extra capacity, as you're not encoding one subword at a time.
It must be the tokenizer. Figuring out words from an image is harder (edges, shapes, letters, words, ...), yet internal representations are more efficient.
I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.
I absolutely think that it can, but it depends on what mean you associate with each pixel.