← Back to context

Comment by ComputerGuru

9 months ago

We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.

One good self-hosted OCR is PaddleOCR, https://github.com/PaddlePaddle/PaddleOCR

Beats everything else, truly international and multi-lingual, including Chinese (as it is made in China)

  • It is insanely fast compared alternatives and has really high accuracy even on new tasks without any training.

    Their PaddleLayout models are also miles ahead compared to LayoutParser or TableTransformers in both inference speed and output quality

  • Why is it “self-hosted” and not “library + desktop/cli app”? “Self-hosted” implies it need a full web stack and rdbms backend?

    • It was just to show that you can run it locally, in opposition to "cloud APIs" referred in the thread, but you are right, the more correct term is local

      1 reply →

  • Holy Crap! You were right about PaddleOCR. My personal benchmark for OCR tools is to submit several random pages from the first edition Moody's Manual for Railroads.

    https://imgur.com/r2RsJeH

    The reason I use it is to test whether it's just analyzing letter-by-letter (even if they claim it does more) or if it's actually scanning the letter/word in its context. If it's letter-by-letter, I get hilariously awful results.

    Sure, it got things wrong. But it also figured out some things even I couldn't decipher.

There's certainly smaller and even better models for OCR.

But the whole "point" of LLM (forget it, it's not AGI) is you don't need to make many specialized models and cursed pipelines anymore, to solve a definitely-in-reach-without-LLM problem your farmer neighbor wants to pay $500 for.

Before LLM it's not going to be done as it takes more than $500 engineer hours. Now we just brute force. Sure, more compute, but we get it done!

I guess your OCR dream is covered by this.

  • > There's certainly smaller and even better models for OCR

    Could you please list some? I am developing a tool that relies on OCR and everything I've found refers to tesseract as being the best choice

A good open source model for handwriting recognition is sorely missing as well.

  • The United States Postal Service probably has the best in the world, though its training probably restricts it to a subset of possible inputs. I wonder if it would be possible to get a senator or congressman to push for open sourcing it.

    • I believe the USPS system makes extensive use of knowledge of possible valid addresses so you're probably right that it wouldn't be generally applicable. Their _dataset_ must be glorious (and extremely confidential) though.

  • Often in humans, too, depending on the badness of the particular handwritten word.

hmmm I haven't tried but does apple's OCR api do better here? ie. is it possible to do it.

  • The API: https://developer.apple.com/documentation/vision/recognizing...

    In my experience it works remarkably well for features like scanning documents in Notes and in copying or translating text embedded in images in Safari.

    It is not open source, but free to use locally. Someone has written a Python wrapper (apple-ocr) around it if you want to use it in other workflows. The model files might be in /System/Library/PrivateFrameworks/TextRecognition.framework if you wanted to port them to other platforms.

    • I also wrote a Swift CLI that wraps over the Vision framework: https://github.com/nexuist/seev

      Text extraction is included (including the ability to specify custom words not found in the dictionary) but there are also utilities for face detection, classification, etc.

Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.

[0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5

  • Okay, I got Kosmos-2.5 running - here's a mini review:

    It's _extremely_ slow, about 30 seconds/page on an A10G. Maybe there's room to improve that, I don't know, but for now that's a problem.

    The actual character recognition is superb, probably comparable to the big cloud offerings. On a sample of pages of moderately challenging typed text it was literally flawless aside from non-ascii characters.

    It can do _neat_ handwriting with reasonable accuracy. This surprised me since there doesn't seem to be any handwriting in the training data (for Pix2Struct either). However, it will sometimes just skip handwriting.

    The structured (markdown) output is sometimes impressive, occasionally a mess. I noticed two weaknesses in particular: it often starts a table as an HTML table and then switches to markdown, and it struggles to distinguish multi-column layouts unless they're pure book-like paragraphs or a table with clear gridlines. This is probably a result of sane/straightforward layouts from READMEs and scientific papers being most represented in the training data (the industry I'm in produces lots of documents with layouts that are wild even to a human).

    One other thing: as a generative model it can and will go off the rails. One document I gave it with a lot of messy handwriting produced the typed header and then just 1500 lines of greater-than symbols. To be fair I couldn't read it either. While I didn't see it produce any valid-looking but "hallucinated" output, that's a possibility too.

  • It works really well for captioning. The few attempts I made at OCR failed miserably on CCTV images (camera label at top and datetime stamp on bottom).

Fully agree

Improving OCR would require innovation within CV - separate from transformer architectures and frankly I don’t expect much new work to happen here