Comment by jaffa2

5 months ago

Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them Back togethr into a compatible pdf.

10 comments

jaffa2

joecool1029 5 months ago

Cheap network locked iphone SE2's on ebay seem to be a cost effective way with good accuracy: https://findthatmeme.com/blog/2023/01/08/image-stacks-and-ip...

jjice 5 months ago

Very interesting article. I'd be interested to know if a M-series Mac Mini (this article was early 2023, so there should've been M1 and M2) would have also filled this role just fine.
> My preliminary speed tests were fairly slow on my MacBook. However, once I deployed the app to an actual iPhone the speed of OCR was extremely promising (possibly due to the Vision framework using the GPU).
I don't know a lot about the specifics of where (hardware-wise) this gets run, but I'd assume any semi-modern Mac would also have an accelerated compute for this kind of thing. Running it on a Mac Mini would ease my worries about battery and heat issues. I would've guessed that they'd scale better as well, but I have no idea if that's actually the case. Also, you'd be able to run the server as a service for automatic restarts and such.
All that said, a rack of iPhones is pretty fun.

sandreas 5 months ago

You could try OCRmyPDF (https://github.com/ocrmypdf/OCRmyPDF)

jaffa2 5 months ago

Thanks for the message. Im talking about MRC ( https://en.m.wikipedia.org/wiki/Mixed_raster_content ) not just invisible text layer over image.
You can achieve colour PDFs smaller than group4 binary compression of the same images. And 10x smaller than a Jpeg compressed PDF
I handle many scanned documents so my source data is typically a 300ppi image of a book/document/newspaper etc

kergonath 5 months ago

> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?

Tesseract is nice, but not good enough that there is no opportunity for another, better solution.

aidenn0 5 months ago

> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?

This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.

aidenn0 5 months ago

The internet archive generates MRC pdfs and have open-sourced their tooling: https://github.com/internetarchive/archive-pdf-tools

jaffa2 4 months ago

Thank you

fny 5 months ago

Run Tesseract on a screenshot and you'll be underwhelmed.

danpla 5 months ago

With proper image pre-processing, Tesseract can recognize even tiny text (5-7 px high).