Comment by bt3

1 year ago

One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.

[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...

1 comment

bt3

maeil 1 year ago

Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.