Comment by modeless
8 days ago
VLMs seem to render traditional OCR systems obsolete. I'm hearing lately that Gemini does a really good job on tasks involving OCR. https://news.ycombinator.com/item?id=42952605
Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.
Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.
I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.
Paradoxically, LLMs should be the tool to fix traditional OCR by recognizing that "Charles ||I" should be "Charles III", "carrot ina box" should be "carrot in a box", the century of the event in context cannot be that construed through looking at the gliphs etc.
As someone who's learning how to do OCR in order to re-OCR a bunch of poorly digitized documents, this will not work with modern OCR. Modern OCR is too good.
If you're able to improve the preprocessing and recognition enough, then there's a point at which any post-processing step you do will introduce more errors than it fixes. LLM's are particularly bad as a post-processing step because the errors they introduce are _designed to be plausible_ even when they don't match the original text. This means they can't be caught just by reading the OCR results.
I've only learned this recently, but it's something OCR experts have known for over a decade, including the maintainers of Tesseract. [1]
OCR is already at the point where adding an LLM at the end is counterproductive. The state of the art now is to use an LSTM (also a type of neural network) which directly recognizes the text from the image. This performs shockingly well if trained properly. When it does fail, it fails in ways not easily corrected by LLM's. I've OCR'ed entire pages using Tesseract's new LSTM engine where the only errors were in numbers and abbreviations which an LLM obviously can't fix.
[1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...
9 replies →
Maybe you can passthrough the completed text from a simple, fast grammar model to improve text. You don't need a 40B/200GB A100 demanding language model to fix these mistakes. It's absurdly wasteful in every sense.
I'm sure there can be models which can be accelerated on last gen CPUs AI accelerators and fix these kinds of mistakes faster than real time, and I'm sure Microsoft Word is already doing it for some languages, for quite some time.
Heck, even Apple has on-device models which can autocomplete words now, and even though its context window or completion size are not that big, it allows me to jump ahead with a simple tap to tab.
3 replies →
One might need document/context fine-tuning. It's not beyond possibility that a prayer of text is about Charles ll1 (el-el-one) someone's pet language model or something. Sometimes you want correction of "obvious" mistakes (like with predictive text) other times you really did write keming [with an M].
1 reply →
Gemini does not seem to do OCR with LLM. They seem to use their existing OCR technology of which they feed the output into the LLM. If you set the temperature to 0 and ask for the exact text as found in the document, you get really good results. I once got weird results where I got literally the JSON output of the OCR result with bounding boxes and everything.
Interesting, thanks for the information. However, unfortunately, no way I'll be sending my personal notebooks to a service which I don't know what's going to do with them in the long term. However, I might use it for the publicly available information.
Thanks again.
I've been looking for an algorithm for running OCR or STT results through multiple language models, compare the results and detect hallucinations as well as correct errors by combining the results in a kind of group consensus way. I figured someone must have done something similar already. If anyone has any leads or more thoughts on algorithm implementation, I'd appreciate it.
Tesseract wildly outperforms any VLM I've tried (as of November 2024) for clean scans of machine-printed text. True, this is the best case for Tesseract, but by "wildly outperforms" I mean: given a page that Tesseract had a few errors on, the VLM misread the text everywhere that Tesseract did, plus more.
On top of that, the linked article suggests that Gemini 2.0 can't give meaningful bounding boxes for the text it OCRs, which further limits the places in which it can be used.
I strongly suspect that traditional OCR systems will become obsolete, but we aren't there yet.
I just wrapped up a test project[0] based on a comment from that post! My takeaway was that there are a lot of steps in the process you can farm out to cheaper, faster ML models.
For example, the slowest part of my pipeline is picture description since I need a LLM for that (and my project needs to run on low-end equipment). Locally I can spin up a tiny LLM and get one-word descriptions in a minute, but anything larger takes like 30. I might be able to only send sections I don't have the hardware to process.
It was a good into to ML models incorporating vision, and video is "just" another image pipeline, so it's been easy to look at e.g. facial recognition groupings like any document section.
[0] https://github.com/jnday/ocr_lol
For self hosting check out Qwen-VL: https://github.com/QwenLM/Qwen-VL
I just used Gemini as an OCR a couple of hours ago because all the OCR apps I tried on android failed at the task lol Wild seeing this commment right after waking up
Yes, I agree general purpose is the way to go, but I'm still waiting. Gemini is the best at last time I tried, but for all the ways I've tried to prompt it, it can not transcribe (or correctly understand the content of) e.g. the probate documents I try to decipher for my genealogy research.
I've seen Gemini Flash 2 mention "in the OCR text" when responding to VQA tasks which makes me question of they have a traditional OCR process mixed in the pipeline.
> Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
Maybe this is what the age of desktop AGI looks like.
Wouldn’t an AI make assumptions and fix mistakes?
For example instead of
> The speiling standards were awful
It would produce
> The spelling standards were awful