Comment by jerojero
3 months ago
In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.
But the tweet itself is kinda an answer to the question you're asking.
How is it materially different from using each char (or each byte) as the token?