Comment by minimaxir

5 months ago

Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.

I suspect the issue is prompt engineering related.

> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

> - Values should be percentages of the image width and height (0 to 1)

LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.

3 comments

minimaxir

fngjdflmdflg 5 months ago

Just tried this and it did not appear to work for me. Prompt:

>Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

>this input document is 1080 x 1236 px. return the bounding boxes as integers

BoorishBears 5 months ago

https://github.com/google-gemini/cookbook/blob/a916686f95f43...
They say there's no magic prompt but I'd start with their default since there is usually some format used to improve performance with posttraining with tasks like this
minimaxir 5 months ago

"Might" being the operative word, particularly with models that have less prompt adherence. There's a few other prompt massaging tricks beyond the scope of a HN comment, the decimal issue is just one optimization.