Comment by jll29

5 months ago

In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.

14 comments

jll29

zeograd 5 months ago

I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)

moffkalast 5 months ago

When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?

ianhawes 5 months ago

Surya is on par with cloud vision offerings.

ritvikpandey21 5 months ago

would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task

sumedh 5 months ago
Are you targeting business or consumers?
I cannot find the pricing page.
- sidmanchkanti21 5 months ago
  
  our current customers are both enterprises and individuals.
  pricing page is here https://www.runpulse.com/pricing-studio-pulse
  
  4 replies →

patcon 5 months ago

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

ahoka 5 months ago

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

pbhjpbhj 5 months ago

If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?

mdbmdb 5 months ago

I would love to get access to that archive!