Comment by tippytippytango
2 days ago
The python zoom in seems performative. A vision model already has access to all the data, how does zooming in help it? Still very cool that it can!
2 days ago
The python zoom in seems performative. A vision model already has access to all the data, how does zooming in help it? Still very cool that it can!
Vision models are typically bad at small details. If there’s too much stuff going on at once, they can’t focus on the entire image.
Yeah, I'm a little unconvinced by that. My best guess there is that the vision input has quite a restricted resolution and "zooming in" (really, cropping to an area) lets it get more information about the region of the photo because it's not as "fuzzy". Just a hunch though.
Yeah, once it gets converted into tokens how does "zooming in" somehow increase information content?
It's cropping the original image then tokenizing it again with less downsampling, not cropping its internal representation.