Comment by jbarrow

1 year ago

> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.