Comment by pietz

3 months ago

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

  • > Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

    But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

    For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.

    • > But that's something else, that's no longer just OCR ("Optical Character Recognition").

      Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

      It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.

      1 reply →

    • > But that's something else, that's no longer just OCR ("Optical Character Recognition").

      Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

      It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.

  • I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.

Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.

  • This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.

  • We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.

    • I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the reality that they are not very accurate, nor can be properly measured. That said, there clearly are use cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.

      1 reply →

  • Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

    I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

  • But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness

  • Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).

  • These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

    Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

    Google's image quality on uploads is still streets ahead of openai for instance btw.

  • Do any LLM OCRs give bounding boxes anyway? Per character and per block.

No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them requiere a lot of human intervention

> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

  • This.

    Any complex parent table span cell relationship still has low accuracy.

    Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

    They will fail.

  • I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail through both OCR and OpenAI.

    It's a hard (and very interesting) problem space.

Complex documents is where OCR struggles mightily. If you have a simple document with paragraphs of text, sure OCR is pretty solved. If you have a complex layout with figures and graphs and supporting images and asides and captions and so on (basically any paper, or even trade documents), it absolutely falls apart.

And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.

There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.

There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on your problem domain.

Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.

But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.

> OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.

Gemini is definitely the best model for OCR. But it has a really high rate of "recitation" errors. Where it will determine the output token is too close to its training data and cut it off. Something like 10% of the time from our testing. Also it has this hilarious hallucination when you have a blank page in the document mix and it just makes up new info.

OpenAI is OK. GPT5 wasn't any better than 4o or 4.1. Main issues were: dropping content like headers/footers, loses it's mind on sideways pages, and will frequently refuse to read things like ID documents, health care forms, or things it judges to have too much PII.

VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.

I've only used tesseract, 'recreationally', but i tried generating images of random chars to see what resolution/contrast/noise was minimally recognisable; shocked at how bad it was. heavily relies on language models of character sequences, pretty useless On 'line noise'

That will not work with many of the world's most important documents because of information density. For example, dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting and features like strikethroughs, etc.

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.

Clearly-printed text to a sequence of characters is solved, for use cases that don't require 100% accuracy.

But not for semantic document structure — recognizing that the grammatically incomplete phrase in a larger font is a heading, recognizing subheadings and bullet lists, tables, etc.

Also not for handwritten text, text inside of images (signage and so forth), or damaged source material (old photocopies and scans created in the old days).

Those areas all seem to me where an LLM-based approach could narrow the gap between machine recognition and humans. You have to sort of reason about it from the context as a human to figure it out, too.

OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.

I agree, Gemini 2.5 models are excellent.

The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.

It would also be nice for there to be a good competitor.

Why PDF parsing is Hell[1]:

Fixed layout and lack of semantic structure in PDFs.

Non-linear text flow due to columns, sidebars, or images.

Position-based text without contextual or relational markers.

Absence of standard structure tags (like in HTML).

Scanned or image-based PDFs requiring OCR.

Preprocessing needs for scanned PDFs (noise, rotation, skew).

Extracting tables from unstructured or visually complex layouts.

Multi-column and fancy layouts breaking semantic text order.

Background images and watermarks interfering with text extraction.

Handwritten text recognition challenges.

[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

Can you explain more about your setup ? I have a quarter million pages I want to OCR.

I attempted OCR using all of the open source models available about 3 months ago, including Llama 4. These were pngs of text using a regular font. Most produced garbage except Llama 4, and even then it was only about 90% accurate. Using OpenAI or Gemini produced much better results but the open source models were really bad.

OCR for printed documents is super robust, but handwriting, low res, and aligned recognition (not just image to "hello world" but also having "h is here in space e is here in space...) are all still well behind "basically solved."

I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams almost all models still struggle. especially very complicated commutative diagrams.