Comment by fngjdflmdflg

5 months ago

>Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

This is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.

4 comments

fngjdflmdflg

minimaxir 5 months ago

Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.

I suspect the issue is prompt engineering related.

> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

> - Values should be percentages of the image width and height (0 to 1)

LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.

fngjdflmdflg 5 months ago
Just tried this and it did not appear to work for me. Prompt:
>Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
>this input document is 1080 x 1236 px. return the bounding boxes as integers
- BoorishBears 5 months ago
  
  https://github.com/google-gemini/cookbook/blob/a916686f95f43...
  They say there's no magic prompt but I'd start with their default since there is usually some format used to improve performance with posttraining with tasks like this
- minimaxir 5 months ago
  
  "Might" being the operative word, particularly with models that have less prompt adherence. There's a few other prompt massaging tricks beyond the scope of a HN comment, the decimal issue is just one optimization.