Comment by ComputerGuru

1 year ago

We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.

23 comments

ComputerGuru

rvnx 1 year ago

One good self-hosted OCR is PaddleOCR, https://github.com/PaddlePaddle/PaddleOCR

Beats everything else, truly international and multi-lingual, including Chinese (as it is made in China)

paul-tharun 1 year ago

It is insanely fast compared alternatives and has really high accuracy even on new tasks without any training.
Their PaddleLayout models are also miles ahead compared to LayoutParser or TableTransformers in both inference speed and output quality
ComputerGuru 1 year ago
Why is it “self-hosted” and not “library + desktop/cli app”? “Self-hosted” implies it need a full web stack and rdbms backend?
- rvnx 1 year ago
  
  It was just to show that you can run it locally, in opposition to "cloud APIs" referred in the thread, but you are right, the more correct term is local
  
  1 reply →
jakderrida 1 year ago

I think that's Baidu. I remember https://github.com/PaddlePaddle/ from when Ernie 3.0 was released back when text encoder models weren't forgotten with the progress of decoder-only ones.
jakderrida 1 year ago

Holy Crap! You were right about PaddleOCR. My personal benchmark for OCR tools is to submit several random pages from the first edition Moody's Manual for Railroads.
https://imgur.com/r2RsJeH
The reason I use it is to test whether it's just analyzing letter-by-letter (even if they claim it does more) or if it's actually scanning the letter/word in its context. If it's letter-by-letter, I get hilariously awful results.
Sure, it got things wrong. But it also figured out some things even I couldn't decipher.

rfoo 1 year ago

There's certainly smaller and even better models for OCR.

But the whole "point" of LLM (forget it, it's not AGI) is you don't need to make many specialized models and cursed pipelines anymore, to solve a definitely-in-reach-without-LLM problem your farmer neighbor wants to pay $500 for.

Before LLM it's not going to be done as it takes more than $500 engineer hours. Now we just brute force. Sure, more compute, but we get it done!

I guess your OCR dream is covered by this.

Zetaphor 1 year ago
> There's certainly smaller and even better models for OCR
Could you please list some? I am developing a tool that relies on OCR and everything I've found refers to tesseract as being the best choice
- jaggirs 1 year ago
  
  Surya OCR

EarlyOom 1 year ago

We're trying to do something similar with VLM-1 https://vlm-docs.nos.run/guides/guide-pdf-presentations. We've found that a lot of the peculiarities of LLMs for text parsing (hallucinations etc.) can be avoided with structured output that restricts everything to a known schema/output range while constraining the number of output tokens required.

orbital-decay 1 year ago

A good open source model for handwriting recognition is sorely missing as well.

ComputerGuru 1 year ago
The United States Postal Service probably has the best in the world, though its training probably restricts it to a subset of possible inputs. I wonder if it would be possible to get a senator or congressman to push for open sourcing it.
- daemonologist 1 year ago
  
  I believe the USPS system makes extensive use of knowledge of possible valid addresses so you're probably right that it wouldn't be generally applicable. Their _dataset_ must be glorious (and extremely confidential) though.
nine_k 1 year ago

Often in humans, too, depending on the badness of the particular handwritten word.

asadm 1 year ago

hmmm I haven't tried but does apple's OCR api do better here? ie. is it possible to do it.

rgovostes 1 year ago
The API: https://developer.apple.com/documentation/vision/recognizing...
In my experience it works remarkably well for features like scanning documents in Notes and in copying or translating text embedded in images in Safari.
It is not open source, but free to use locally. Someone has written a Python wrapper (apple-ocr) around it if you want to use it in other workflows. The model files might be in /System/Library/PrivateFrameworks/TextRecognition.framework if you wanted to port them to other platforms.
- nexuist 1 year ago
  
  I also wrote a Swift CLI that wraps over the Vision framework: https://github.com/nexuist/seev
  Text extraction is included (including the ability to specify custom words not found in the dictionary) but there are also utilities for face detection, classification, etc.

daemonologist 1 year ago

Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.

[0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5

daemonologist 1 year ago

Okay, I got Kosmos-2.5 running - here's a mini review:
It's _extremely_ slow, about 30 seconds/page on an A10G. Maybe there's room to improve that, I don't know, but for now that's a problem.
The actual character recognition is superb, probably comparable to the big cloud offerings. On a sample of pages of moderately challenging typed text it was literally flawless aside from non-ascii characters.
It can do _neat_ handwriting with reasonable accuracy. This surprised me since there doesn't seem to be any handwriting in the training data (for Pix2Struct either). However, it will sometimes just skip handwriting.
The structured (markdown) output is sometimes impressive, occasionally a mess. I noticed two weaknesses in particular: it often starts a table as an HTML table and then switches to markdown, and it struggles to distinguish multi-column layouts unless they're pure book-like paragraphs or a table with clear gridlines. This is probably a result of sane/straightforward layouts from READMEs and scientific papers being most represented in the training data (the industry I'm in produces lots of documents with layouts that are wild even to a human).
One other thing: as a generative model it can and will go off the rails. One document I gave it with a lot of messy handwriting produced the typed header and then just 1500 lines of greater-than symbols. To be fair I couldn't read it either. While I didn't see it produce any valid-looking but "hallucinated" output, that's a possibility too.
Kerbonut 1 year ago

It works really well for captioning. The few attempts I made at OCR failed miserably on CCTV images (camera label at top and datetime stamp on bottom).

AndrewKemendo 1 year ago

Fully agree

Improving OCR would require innovation within CV - separate from transformer architectures and frankly I don’t expect much new work to happen here