← Back to context

Comment by ks2048

9 days ago

I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.

For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.

qwen2.5-vl-72b-instruct seems perfectly happy outputting bounding boxes in my testing.

There's also a paper https://arxiv.org/pdf/2409.12191 where they explicitly say some of their training included bounding boxes and coordinates.

If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.

  • In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(

I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.

I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.