OCR4all

8 days ago (ocr4all.org)

The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.

We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.

  • > We need to do end to end text recognition. Not "character recognition", it's not the characters we care about.

    Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.

    • > Arbitrary nonsensical text require character recognition.

      Are you sure? I mean, if it's printed text in a non-connected script, where characters repeat themselves (nearly) identically, then ok, but if you're looking at handwriting - couldn't one argue that it's _words_ that get recognized? And that's ignoring the question of textual context, i.e. recognizing based on what you know the rest of the sentence to be.

      4 replies →

  • VLMs seem to render traditional OCR systems obsolete. I'm hearing lately that Gemini does a really good job on tasks involving OCR. https://news.ycombinator.com/item?id=42952605

    Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.

    • The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.

      Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.

      I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.

      20 replies →

    • Tesseract wildly outperforms any VLM I've tried (as of November 2024) for clean scans of machine-printed text. True, this is the best case for Tesseract, but by "wildly outperforms" I mean: given a page that Tesseract had a few errors on, the VLM misread the text everywhere that Tesseract did, plus more.

      On top of that, the linked article suggests that Gemini 2.0 can't give meaningful bounding boxes for the text it OCRs, which further limits the places in which it can be used.

      I strongly suspect that traditional OCR systems will become obsolete, but we aren't there yet.

    • I just wrapped up a test project[0] based on a comment from that post! My takeaway was that there are a lot of steps in the process you can farm out to cheaper, faster ML models.

      For example, the slowest part of my pipeline is picture description since I need a LLM for that (and my project needs to run on low-end equipment). Locally I can spin up a tiny LLM and get one-word descriptions in a minute, but anything larger takes like 30. I might be able to only send sections I don't have the hardware to process.

      It was a good into to ML models incorporating vision, and video is "just" another image pipeline, so it's been easy to look at e.g. facial recognition groupings like any document section.

      [0] https://github.com/jnday/ocr_lol

    • Yes, I agree general purpose is the way to go, but I'm still waiting. Gemini is the best at last time I tried, but for all the ways I've tried to prompt it, it can not transcribe (or correctly understand the content of) e.g. the probate documents I try to decipher for my genealogy research.

    • I just used Gemini as an OCR a couple of hours ago because all the OCR apps I tried on android failed at the task lol Wild seeing this commment right after waking up

    • I've seen Gemini Flash 2 mention "in the OCR text" when responding to VQA tasks which makes me question of they have a traditional OCR process mixed in the pipeline.

    • > Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.

      Maybe this is what the age of desktop AGI looks like.

    • Wouldn’t an AI make assumptions and fix mistakes?

      For example instead of

      > The speiling standards were awful

      It would produce

      > The spelling standards were awful

  • Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.

    Even I used symbols for different means in a shorthand form when constructing an idea.

    I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.

    Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.

  • Could you dumb this down a bit (a lot) for dimmer readers, like myself? The way I am understanding the problem you are getting at is something like:

    > The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.

    > The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.

    I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?

    I will say, if nothing else, I can understand certain physical considerations. For example:

    A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.

    In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.

    • The lower case "e" in gothic cursive often looks like a lower case "r". If you see one of these: ſ maybe you think "ah, I know that one, that's an S!" and yes, it is, but some scribes when writing a capital H makes something that looks a LOT like it. You need context to disambiguate. Think of it as a cryptogram: if you see a certain squiggle in a context where it's clearly an "r", you can assume that the other squiggles that look like that are "r"s too. Familiarity with a scribe's hand is often necessary to disambiguate squiggles, especially in words such as proper names, where linguistic context doesn't help you a lot. And it's often the proper names which are the most interesting part of a document.

      But yes, writers can change style too. Mercifully, just like we sometimes use all caps for surnames, so some writers would use antika-style handwriting (i.e. what we use today) for proper names in a document which is otherwise all gothic-style handwriting. But this certainly doesn't happen consistently enough that you can rely on it, and some writers have so messy handwriting that even then, you need context to know what they're doing.

  • The problem is payong experts to properly train a model is expensive, doubly when you want larger context.

    Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.

    Surpise: Garbage CEOs in, garbage intelligence out.

Looks like a great project, and I don't want to nitpick, but...

https://www.ocr4all.org/about/ocr4all > Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

https://www.ocr4all.org/guide/setup-guide/quickstart > Quickstart > Open a terminal of your choice and enter the following command if you're running Linux (followed by a 6 line docker command).

How is that addressing the needs of non-technical users?

  • Any end-user application that uses docker is not an end-user application. It does not matter if the end-user knows how to use docker or not. End-user applications should be delivered as SaaS/WebUi or a local binary (GUI or CLI). Period.

  • Application installation isn't a user level task. The application being ready for a user to use, and being easy to install are separate. You get your IT literate helper to install for you, then, if the program is easy for users to use you're golden.

    • That's a very corporate mentality. Outside of an organizational context, installing applications certainly is a normal user level task. And for those users that have somebody help them, that somebody is usually just a younger person who's comfortable clicking 'Next' to get through an installer but certainly has no devops experience.

      1 reply →

  • s/non-technical users/technical users who are into docker and don't mind filling their computers with large files for no good reason/

A little secret: Apple’s Vision Framework has an absurdly fast text recognition library with accuracy that beats Tesseract. It consumes almost any image format you can think of including PDFs.

I wrote a simple CLI tool and more featured Python wrapper for it: https://github.com/fny/swiftocr

  • This has been one of my favorite features Apple added. When I’m in a call and someone shares a page I need the link to, rather than interrupt the speaker and ask them to share the link it’s often faster to screengrab the url and let Apple OCR the address and take me to the page/post it in chat.

  • After getting an iPhone and exploring some of their API documentation after being really impressed with system provided features, I'm blown away by the stuff that's available. My app experience on iOS vs Android is night and day. The vision features alone have been insane, but their text recognition is just fantastic. Any image and even my god awful handwriting gets picked up without issue.

    That said, I do love me a free and open source option for this kind of thing. I can't use it much since I'm not using Apple products for my desktop computing. Good on Apple though - they're providing some serious software value.

    • I can't comment on what Apple is doing here, but Google has an equivalent called "lens" which works really well and I use it in the way you suggest here.

  • How does it work with tables and diagrams? I have scanned pages with mixed media, like some are diagrams, I want to be able to extract the text but tell me where the diagrams are in the image with coordinates.

  • I wonder if it's possible to reverse engineer that, rip it out, and put it on Linux. Would love to have that feature without having to use Apple hardware

> How is this different from tesseract and friends?

The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.

  • I didn't have good results in tesseract, so I hope this is really different ;)

    I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.

    • I have never had to handle handwriting professionally but I have had great success with Tesseract in the past. I’m sure it’s no longer the best free/cheap option but with a little bit of image pre-processing to ensure the text pops from the background and isn’t unnecessarily large (I.e. that 1200dpi scan is overkill) you can have a pretty nice pipeline with good results.

      In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.

      I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.

    • I have a typewriter written manuscript that is interspersed with hand written editing. Tesseract worked fine until the hand written part, then garbage. Is there a local solution that anyone can recommend? I have a 16gb lenovo laptop and access to a workstation with a with an RTX 4070 ti 16gb card. Thanks.

  • Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.

If you are interested, I also made an AI assisted OCR API - https://github.com/kdeps/examples

It combines Tesseract (for images) and Poppler-utils (PDF). A local open-source LLMs will extract document segments intelligently.

It can also be extended to use one or multiple Vision LLM models easily.

And finally, it outputs the entire AI agent API into a Dockerized container.

> Designed with usability in mind

Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces.

[...] https://www.ocr4all.org/guide/setup-guide/windows

------------------

I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.

I think the current sweet-spot for speed/efficiency/accuracy is to use Tesseract in combination with an LLM to fix any errors and to improve formatting, as in my open source project which has been shared before as a Show HN:

https://github.com/Dicklesworthstone/llm_aided_ocr

This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.

  • What kind of accuracy have you reached with this pipeline of Tesseract+LLM? I imagine that there would be a hard limit as to what level the LLM could improve the OCR extract text from Tesseract, since its far from perfect itself.

    Haven't seen many people mention it, but have just been using the PaddleOCR library on it's own and has been very good for me. Often achieving better quality/accuracy than some of the best V-LLM's, and generally much better quality than other open-source OCR models I've tried like Tesseract for example.

    That being said, my use case is definitely focused primarily on digital text, so if you're working with handwritten text, take this with a grain of salt.

    https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_e...

    https://huggingface.co/spaces/echo840/ocrbench-leaderboard

  • Have you used your project on classical languages like Latin / Ancient Greek / Hebrew etc? Will the LLM fall flat in those cases, or be able to help?

    • I haven’t, but I bet it would work pretty well, particularly if you tweaked the prompts to explain that it’s dealing with Ancient Greek or whatever and give a couple examples of how to handle things.

What is this? A new SOTA OCR engine (which would be very interesting to me) or just a tool that uses other known engines (which would be much less interesting to me).

A movement? A socio-political statement?

If only landing pages could be clearer about wtf it actually is ...

„OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material.“

It seems to be based on OCR-D, which itself is based on

- https://github.com/tesseract-ocr/tesseract

- https://kraken.re/main/index.html

- https://github.com/ocropus-archive/DUP-ocropy

- https://github.com/Calamari-OCR/calamari

See

- https://ocr-d.de/en/models

It seems to be an open-source alternative to https://www.transkribus.org/ ( which uses amongst others https://atr.pages.teklia.com/pylaia/pylaia/ )

Another alternative is https://escriptorium.inria.fr/ ( which uses kraken)

Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them Back togethr into a compatible pdf.

  • Cheap network locked iphone SE2's on ebay seem to be a cost effective way with good accuracy: https://findthatmeme.com/blog/2023/01/08/image-stacks-and-ip...

    • Very interesting article. I'd be interested to know if a M-series Mac Mini (this article was early 2023, so there should've been M1 and M2) would have also filled this role just fine.

      > My preliminary speed tests were fairly slow on my MacBook. However, once I deployed the app to an actual iPhone the speed of OCR was extremely promising (possibly due to the Vision framework using the GPU).

      I don't know a lot about the specifics of where (hardware-wise) this gets run, but I'd assume any semi-modern Mac would also have an accelerated compute for this kind of thing. Running it on a Mac Mini would ease my worries about battery and heat issues. I would've guessed that they'd scale better as well, but I have no idea if that's actually the case. Also, you'd be able to run the server as a service for automatic restarts and such.

      All that said, a rack of iPhones is pretty fun.

  • > Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?

    Tesseract is nice, but not good enough that there is no opportunity for another, better solution.

  • > Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?

    This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.

  • Run Tesseract on a screenshot and you'll be underwhelmed.

    • With proper image pre-processing, Tesseract can recognize even tiny text (5-7 px high).

As this project is geared toward "early modern prints", any recommendations for the best OCR/LLM solution for poor-quality typed manuscripts?

Wow. Setup took 12 GB of my disk. First impression: nice UI, but no idea what to do with it or how to create a project. Tells me "session expired" no matter what I try to do. Definitely not batteries-included kind of stuff, will need to explore later.

I've been looking for a project that would have an easy free/extremely cheap way to do OCR/image recognition for generating ALT text automatically for social media. Some sort of embedded implementation that looks at an image and is either able to transcribe the text, or (preferably) transcribe the text AND do some brief image recognition.

I generally do this manually with Claude and it's able to do it lightning fast, but a small dev making a third party Bluesky/Mastodon/etc client doesn't have the resources to pay for an AI API.

  • Such an approach moves the cost of accessibility to each user individually. It is not bad as a fallback mechanism, but I hope that those who publish won't decide that AI absolves them of the need to post accessible content. After all, if they generate the alt text on their side, they can do it only once and it would be accessible to everyone while saving multiple executions of the same recognition task on the other end. Additionally, they have more control how the image would be interpreted and I hope that this really would matter.

What differentiates this from other tools? Eg tesseract, EasyOcr?

They lost me when they suggested I install docker.

Now, I wouldn't mind if they suggested that as an _option_ for people whose system might exhibit compatibility problems, but - come on! How lazy can you get? You can't be bothered to cater to anything other than your own development environment, which you want us to reproduce? Then maybe call yourself "OCR4me", not "OCR4all".

  • > How lazy can you get?"

    Genuine question: What would the ideal docker-free solution look like in your opinion? That is, something that is accessible to the average university student, researcher, and faculty member? What installation tool/package manager would this hypothetical common user use? How many hoops would they have to jump through? The hub page lists the various dependencies [0], which on their own are pretty complicated packages.

    [0] https://hub.docker.com/r/uniwuezpd/ocr4all

I don't wish to speak out of turn, but it doesn't look like this project has been active for about 1 year. I checked GitHub and the last update was in Feb 2024. Their last post to X was 25 OCT 2023. :(

It's cool but is there any doubt that this will be very obsolete very soon? This is like how image recognition worked pre-CNN.

(It looks like the project started in 2022. So maybe it wasn't obvious at the time)

This looks promising, not sure how it stacks up to Transkribus which seems to be the leader in the space since it has support for handwritten and trainable ML for your dataset.

Training a model exclusively on written material from specific time periods would be a fascinating way to explore history, specially at schools.

I've been using tesseract for a few years on a personal project, I'd be interested to know how they compare in terms of system resources, given that I am running it on a dell optiplex micro with 8 gigs of ram and 6-th gen i5 - tesseract is barely noticeable so it's just my curiosity at this point, I don't have any reasons to even consider switching over. I do however have a large dataset of several hundred gbs of scanned pdfs which would be worth digitalizing when I find some time to spare.