Comment by Onawa

1 year ago

I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.

I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.

4 comments

Onawa

raybb 1 year ago

Simon wilson recently had a thread going through some of the options here https://x.com/simonw/status/1797526667797442773

Onawa 1 year ago

Funny enough, Simon Willison is the op of this comment thread lol.
mercer 1 year ago

But doctor, I AM Simon Wilson!

infecto 1 year ago

For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.