Comment by krackers

4 months ago

Thank you, this makes sense! As [1] puts it pithily

>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.

That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.

[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902

1 comment

krackers

krackers 4 months ago

(Just noting that https://news.ycombinator.com/item?id=45652952 and the article therein are also worth reading)