Comment by raincole

4 months ago

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

11 comments

raincole

jakewins 4 months ago

But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness

wahnfrieden 4 months ago
Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences
- criddell 4 months ago
  
  I usually run the image(s) through more than one converter then compare the results. They all have problems, but the parts they agree on are usually correct.
- Davidzheng 4 months ago
  
  rarely happens to me using LLMs to transcribe pdfs
- KoolKat23 4 months ago
  
  This must be some older/smaller model.
rkagerer 4 months ago

Good libraries gave results with embedded confidence levels for each unit recognized.

red75prime 4 months ago

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).

KoolKat23 4 months ago

These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

Google's image quality on uploads is still streets ahead of openai for instance btw.

wahnfrieden 4 months ago

Do any LLM OCRs give bounding boxes anyway? Per character and per block.

kelvinjps10 4 months ago

Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...
Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
I hope that this capability improves so I can use only Gemini API.
dajonker 4 months ago

Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.