Comment by raincole
3 months ago
If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.
(I'm not being snarky. It's acceptable in some cases.)
3 months ago
If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.
(I'm not being snarky. It's acceptable in some cases.)
But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness
Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences
I usually run the image(s) through more than one converter then compare the results. They all have problems, but the parts they agree on are usually correct.
rarely happens to me using LLMs to transcribe pdfs
This must be some older/smaller model.
Good libraries gave results with embedded confidence levels for each unit recognized.
Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).
These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).
Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.
Google's image quality on uploads is still streets ahead of openai for instance btw.
Do any LLM OCRs give bounding boxes anyway? Per character and per block.
Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...
Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
I hope that this capability improves so I can use only Gemini API.
Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.