Comment by jaffa2
9 days ago
Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them Back togethr into a compatible pdf.
Cheap network locked iphone SE2's on ebay seem to be a cost effective way with good accuracy: https://findthatmeme.com/blog/2023/01/08/image-stacks-and-ip...
Very interesting article. I'd be interested to know if a M-series Mac Mini (this article was early 2023, so there should've been M1 and M2) would have also filled this role just fine.
> My preliminary speed tests were fairly slow on my MacBook. However, once I deployed the app to an actual iPhone the speed of OCR was extremely promising (possibly due to the Vision framework using the GPU).
I don't know a lot about the specifics of where (hardware-wise) this gets run, but I'd assume any semi-modern Mac would also have an accelerated compute for this kind of thing. Running it on a Mac Mini would ease my worries about battery and heat issues. I would've guessed that they'd scale better as well, but I have no idea if that's actually the case. Also, you'd be able to run the server as a service for automatic restarts and such.
All that said, a rack of iPhones is pretty fun.
You could try OCRmyPDF (https://github.com/ocrmypdf/OCRmyPDF)
Thanks for the message. Im talking about MRC ( https://en.m.wikipedia.org/wiki/Mixed_raster_content ) not just invisible text layer over image.
You can achieve colour PDFs smaller than group4 binary compression of the same images. And 10x smaller than a Jpeg compressed PDF
I handle many scanned documents so my source data is typically a 300ppi image of a book/document/newspaper etc
> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?
Tesseract is nice, but not good enough that there is no opportunity for another, better solution.
> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?
This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.
The internet archive generates MRC pdfs and have open-sourced their tooling: https://github.com/internetarchive/archive-pdf-tools
Thank you
Run Tesseract on a screenshot and you'll be underwhelmed.
With proper image pre-processing, Tesseract can recognize even tiny text (5-7 px high).