Comment by Onawa
9 months ago
I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.
I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.
Simon wilson recently had a thread going through some of the options here https://x.com/simonw/status/1797526667797442773
Funny enough, Simon Willison is the op of this comment thread lol.
But doctor, I AM Simon Wilson!
For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.