Comment by tippytippytango

3 months ago

The python zoom in seems performative. A vision model already has access to all the data, how does zooming in help it? Still very cool that it can!

4 comments

tippytippytango

Legend2440 3 months ago

Vision models are typically bad at small details. If there’s too much stuff going on at once, they can’t focus on the entire image.

simonw 3 months ago

Yeah, I'm a little unconvinced by that. My best guess there is that the vision input has quite a restricted resolution and "zooming in" (really, cropping to an area) lets it get more information about the region of the photo because it's not as "fuzzy". Just a hunch though.

energy123 3 months ago

Yeah, once it gets converted into tokens how does "zooming in" somehow increase information content?

nutrientharvest 3 months ago

It's cropping the original image then tokenizing it again with less downsampling, not cropping its internal representation.