← Back to context

Comment by bayindirh

8 days ago

The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.

Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.

I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.

Paradoxically, LLMs should be the tool to fix traditional OCR by recognizing that "Charles ||I" should be "Charles III", "carrot ina box" should be "carrot in a box", the century of the event in context cannot be that construed through looking at the gliphs etc.

  • As someone who's learning how to do OCR in order to re-OCR a bunch of poorly digitized documents, this will not work with modern OCR. Modern OCR is too good.

    If you're able to improve the preprocessing and recognition enough, then there's a point at which any post-processing step you do will introduce more errors than it fixes. LLM's are particularly bad as a post-processing step because the errors they introduce are _designed to be plausible_ even when they don't match the original text. This means they can't be caught just by reading the OCR results.

    I've only learned this recently, but it's something OCR experts have known for over a decade, including the maintainers of Tesseract. [1]

    OCR is already at the point where adding an LLM at the end is counterproductive. The state of the art now is to use an LSTM (also a type of neural network) which directly recognizes the text from the image. This performs shockingly well if trained properly. When it does fail, it fails in ways not easily corrected by LLM's. I've OCR'ed entire pages using Tesseract's new LSTM engine where the only errors were in numbers and abbreviations which an LLM obviously can't fix.

    [1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...

    • > As someone ... Modern OCR is too good

      I also have even recent extensive experience: I get an important amount of avoidable errors.

      > at which any post-processing step you do will introduce more errors than it fixes ... the errors they [(LLMs)] introduce are _designed to be plausible_

      You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. And even then, given that I can notice errors that an algorithm with some language proficiency could certainly correct, I have reasons to suppose that a proper calibration of an LLM based system can achieve action over a large number of True Positives with a negligible amount of False Positives.

      > LSTM

      I am using LSTM-based engines, and on those outputs I have stated «I get an important amount of avoidable errors». The one thing that could go in your direction is that I am not using the latest version of `tesseract` (though still in the 4.x), and I have recently noticed (already through `tesseract --print-parameters | grep lstm`) that the LSTM engine evolved within 4.x, from early to later.

      > numbers and abbreviations which an LLM obviously can't fix

      ? It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" and for abbreviations, the LLM is exactly the thing that should guess them ot of the context. The LLM is that thing that knows that "the one defeated by Cromwell should really be Charles II-staintoberemoved instead of an apparent Charles III".

      6 replies →

    • > OCR is already at the point where adding an LLM at the end is counterproductive

      That's mass OCR on printed documents. On handwritten documents, LLMs help. There are tons of documents that even top human experts can't read without context and domain language. Printed documents are intended to be readable character by character. Often the only thing a handwriting author intends is to remind himself of what he was thinking when he wrote it.

      Also, what is the downstream tasks? What do you need character level accuracy for? In my experience, it's often for indexing and search. I believe LLMs have a higher ceiling there, and can in principle (if not in practice, yet) find data and answer questions about a text better than straightforward indexing or search can. I can't count the number of times I've e.g. missed a child in genealogy because I didn't think of searching the (fully and usually correctly) indexed data for some spelling or naming variant.

      1 reply →

  • Maybe you can passthrough the completed text from a simple, fast grammar model to improve text. You don't need a 40B/200GB A100 demanding language model to fix these mistakes. It's absurdly wasteful in every sense.

    I'm sure there can be models which can be accelerated on last gen CPUs AI accelerators and fix these kinds of mistakes faster than real time, and I'm sure Microsoft Word is already doing it for some languages, for quite some time.

    Heck, even Apple has on-device models which can autocomplete words now, and even though its context window or completion size are not that big, it allows me to jump ahead with a simple tap to tab.

    • I wonder if this is a case where you want an encoder-decoder model. It seems very much like a translation task, only one where training data is embarrassingly easy to synthesize by just grabbing sentences from a corpus and occasionally swapping, inserting, and deleting characters.

      In terms of attention masking, it seems like you want the input to be unmasked, since the input is fixed for a given “translation”, and then for the output tokens to use causally masked self attention plus cross attention with the input.

      I wonder if you could get away with a much smaller network this way because you’re not pointlessly masking input attention for a performance benefit that doesn’t matter.

      1 reply →

    • > Maybe you can passthrough the completed text from a simple, fast grammar model to improve text

      Yes - but not really a "grammar model": a statistical model about text, with "transformer's attention" - the core of LLMs - should be it: something that identifies if the fed text has statistical anomalies (which the glitches are).

      Unfortunately, small chatbot LLMs do not follow instructions ("check the following text"), they just invent stories, and I am not aware of a specialized model that can be fed text for anomalies. Some spoke about a BERT variant - which still does not have great accuracy, I understood.

      It is a relatively small problem that probably does not have a specialized solution yet. Already a simple input-output box that worked like: "Evaluate statistical probability of each token" - then we would check the spikes of anomaly. (For clarity: this is not plain spellchecking, as we want to identify anomalies in context.)

      --

      Edit: a check I have just done with an engine I had not yet used for the purpose shows a number of solutions... But none a good specific tool, I am afraid.

  • One might need document/context fine-tuning. It's not beyond possibility that a prayer of text is about Charles ll1 (el-el-one) someone's pet language model or something. Sometimes you want correction of "obvious" mistakes (like with predictive text) other times you really did write keming [with an M].

    • And that is why I wrote that you need an LLM, i.e. (next post) a «statistical model about text, with "transformer's attention"», as «[what we want] is not plain spellchecking, as we want to identify anomalies in context».

      To properly correct text you need a system that checks large blocks of text with some understanding, not just disconnected words.

      Edit: minutes ago a member (in a faraway post) wrote «sorry if this is a bit cheeky»... He meant "cheesy". You need some level of understanding to see those mistakes. Marking words that are outside the dictionary is not sufficient.

Gemini does not seem to do OCR with LLM. They seem to use their existing OCR technology of which they feed the output into the LLM. If you set the temperature to 0 and ask for the exact text as found in the document, you get really good results. I once got weird results where I got literally the JSON output of the OCR result with bounding boxes and everything.

  • Interesting, thanks for the information. However, unfortunately, no way I'll be sending my personal notebooks to a service which I don't know what's going to do with them in the long term. However, I might use it for the publicly available information.

    Thanks again.

I've been looking for an algorithm for running OCR or STT results through multiple language models, compare the results and detect hallucinations as well as correct errors by combining the results in a kind of group consensus way. I figured someone must have done something similar already. If anyone has any leads or more thoughts on algorithm implementation, I'd appreciate it.