← Back to context

Comment by bt3

17 days ago

One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.

--

[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...

Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.