← Back to context

Comment by krackers

3 months ago

See this thread https://news.ycombinator.com/item?id=45640720

As I understood the responses, the benefit comes from making better use of the embedding space. BPE tokenization is basically like a fixed lookup table, whereas when you form "image tokens" you just throw each 16x16 patch into a neural-net and (handwave) out comes your embedding. From that, it should be fairly intuitive that since current text tokenization embedding vectors won't even form a subspace (it can only just be ~$VOCAB_SIZE points), image tokens have the capacity to be more information dense. And you might hope that the neural network can somehow make use of that extra capacity, as you're not encoding one subword at a time.