← Back to context

Comment by krackers

3 months ago

Thank you, this makes sense! As [1] puts it pithily

>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.

That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.

[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902