Comment by antirez
3 months ago
This should be "pixels are (maybe) a better representation than the current representation of tokens". Which is very different. Text is surely more information dense than the image containing the same text, so the problem is finding the best representation of text. If each word is expanded to a very large embedding and you see pixels doing better, than the problem is in the representation and not in the text vs image.
No comments yet
Contribute on Hacker News ↗