Comment by jll29
15 days ago
In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.
I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.
I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.
(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)
When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?
Surya is on par with cloud vision offerings.
would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task
Are you targeting business or consumers?
I cannot find the pricing page.
our current customers are both enterprises and individuals.
pricing page is here https://www.runpulse.com/pricing-studio-pulse
4 replies →
Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)
I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.
If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?
I would love to get access to that archive!