Comment by jeswin
14 days ago
If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."
> We have hundreds of examples like this queued up, so let us know if you want some more!
Link to it then, let people verify.
I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.
[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...
99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.
Even with the best OCR, and high resolution scans, you might not get this due to:
- the quality of the original paper documents, and
- the language
I have non-English documents for which I'd love to have 99% accuracy!
Language is often solvable by better dictionaries. I have been forced to make my own dictionaries in the past, that led to similar error rates as more mainstream languages like English. If you are talking about another alphabet like Cyrillic or Arabic etc, that is another problem.
1 reply →
Ha hi Jeswin! I was itching to reply to this post too, I wonder why…
Dave! Our sample sizes were large enough, and tables complex enough to opine on this.
I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.
Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:
The first column is offset vertically which mixes up information and is wrong.
I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:
(edit: line wrap messes it all up... still I think my version is better ;-)
I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.
Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.