Comment by minimaxir

17 days ago

Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.

I suspect the issue is prompt engineering related.

> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

> - Values should be percentages of the image width and height (0 to 1)

LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.

Just tried this and it did not appear to work for me. Prompt:

>Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

>this input document is 1080 x 1236 px. return the bounding boxes as integers