Comment by vintermann
8 days ago
The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.
We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.
> We need to do end to end text recognition. Not "character recognition", it's not the characters we care about.
Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.
> Arbitrary nonsensical text require character recognition.
Are you sure? I mean, if it's printed text in a non-connected script, where characters repeat themselves (nearly) identically, then ok, but if you're looking at handwriting - couldn't one argue that it's _words_ that get recognized? And that's ignoring the question of textual context, i.e. recognizing based on what you know the rest of the sentence to be.
Handwriting with words is not arbitrary nonsensical text
2 replies →
Not really. I have an HTR use case where the data is highly specialized codes. All the OCR software I use is tripped up by trying to find the content into the category of English words.
LLMs can help, but I’ve also had issues where the repetitive nature of the content can reliably result in terrible hallucinations.
VLMs seem to render traditional OCR systems obsolete. I'm hearing lately that Gemini does a really good job on tasks involving OCR. https://news.ycombinator.com/item?id=42952605
Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.
Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.
I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.
Paradoxically, LLMs should be the tool to fix traditional OCR by recognizing that "Charles ||I" should be "Charles III", "carrot ina box" should be "carrot in a box", the century of the event in context cannot be that construed through looking at the gliphs etc.
16 replies →
Gemini does not seem to do OCR with LLM. They seem to use their existing OCR technology of which they feed the output into the LLM. If you set the temperature to 0 and ask for the exact text as found in the document, you get really good results. I once got weird results where I got literally the JSON output of the OCR result with bounding boxes and everything.
1 reply →
I've been looking for an algorithm for running OCR or STT results through multiple language models, compare the results and detect hallucinations as well as correct errors by combining the results in a kind of group consensus way. I figured someone must have done something similar already. If anyone has any leads or more thoughts on algorithm implementation, I'd appreciate it.
Tesseract wildly outperforms any VLM I've tried (as of November 2024) for clean scans of machine-printed text. True, this is the best case for Tesseract, but by "wildly outperforms" I mean: given a page that Tesseract had a few errors on, the VLM misread the text everywhere that Tesseract did, plus more.
On top of that, the linked article suggests that Gemini 2.0 can't give meaningful bounding boxes for the text it OCRs, which further limits the places in which it can be used.
I strongly suspect that traditional OCR systems will become obsolete, but we aren't there yet.
I just wrapped up a test project[0] based on a comment from that post! My takeaway was that there are a lot of steps in the process you can farm out to cheaper, faster ML models.
For example, the slowest part of my pipeline is picture description since I need a LLM for that (and my project needs to run on low-end equipment). Locally I can spin up a tiny LLM and get one-word descriptions in a minute, but anything larger takes like 30. I might be able to only send sections I don't have the hardware to process.
It was a good into to ML models incorporating vision, and video is "just" another image pipeline, so it's been easy to look at e.g. facial recognition groupings like any document section.
[0] https://github.com/jnday/ocr_lol
For self hosting check out Qwen-VL: https://github.com/QwenLM/Qwen-VL
I just used Gemini as an OCR a couple of hours ago because all the OCR apps I tried on android failed at the task lol Wild seeing this commment right after waking up
Yes, I agree general purpose is the way to go, but I'm still waiting. Gemini is the best at last time I tried, but for all the ways I've tried to prompt it, it can not transcribe (or correctly understand the content of) e.g. the probate documents I try to decipher for my genealogy research.
I've seen Gemini Flash 2 mention "in the OCR text" when responding to VQA tasks which makes me question of they have a traditional OCR process mixed in the pipeline.
> Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
Maybe this is what the age of desktop AGI looks like.
Wouldn’t an AI make assumptions and fix mistakes?
For example instead of
> The speiling standards were awful
It would produce
> The spelling standards were awful
Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.
Even I used symbols for different means in a shorthand form when constructing an idea.
I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.
Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.
Could you dumb this down a bit (a lot) for dimmer readers, like myself? The way I am understanding the problem you are getting at is something like:
> The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.
> The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.
I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?
I will say, if nothing else, I can understand certain physical considerations. For example:
A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.
In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.
The lower case "e" in gothic cursive often looks like a lower case "r". If you see one of these: ſ maybe you think "ah, I know that one, that's an S!" and yes, it is, but some scribes when writing a capital H makes something that looks a LOT like it. You need context to disambiguate. Think of it as a cryptogram: if you see a certain squiggle in a context where it's clearly an "r", you can assume that the other squiggles that look like that are "r"s too. Familiarity with a scribe's hand is often necessary to disambiguate squiggles, especially in words such as proper names, where linguistic context doesn't help you a lot. And it's often the proper names which are the most interesting part of a document.
But yes, writers can change style too. Mercifully, just like we sometimes use all caps for surnames, so some writers would use antika-style handwriting (i.e. what we use today) for proper names in a document which is otherwise all gothic-style handwriting. But this certainly doesn't happen consistently enough that you can rely on it, and some writers have so messy handwriting that even then, you need context to know what they're doing.
The problem is payong experts to properly train a model is expensive, doubly when you want larger context.
Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.
Surpise: Garbage CEOs in, garbage intelligence out.