Comment by bt3
17 days ago
One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.
--
[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...
Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.