Comment by krackers
3 months ago
Thank you, this makes sense! As [1] puts it pithily
>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.
That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.
[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902
(Just noting that https://news.ycombinator.com/item?id=45652952 and the article therein are also worth reading)