Comment by ks2048

3 months ago

I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.

For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.

7 comments

ks2048

michaelt 3 months ago

qwen2.5-vl-72b-instruct seems perfectly happy outputting bounding boxes in my testing.

There's also a paper https://arxiv.org/pdf/2409.12191 where they explicitly say some of their training included bounding boxes and coordinates.

themanmaran 3 months ago

We're also looking to test qwen and other for the bounding box support. Simon Willison had a great demo page where he used Gemini 2.5 to draw bounding boxes, and the results were pretty impressive. It would probably be pretty easy to drop qwen into the same UI.
https://simonwillison.net/2025/Mar/25/gemini

chpatrick 3 months ago

Actually qwen 2.5 is trained to provide bounding boxes

deepsquirrelnet 3 months ago

Yep, this is true. I was poking around on their github and they have examples in their “cookbooks” section. Eg:
https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr...

kapitalx 3 months ago

If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.

kapitalx 3 months ago

In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(

jsight 3 months ago

I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.

I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.