Comment by pietz

4 months ago

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

62 comments

pietz

cahaya 4 months ago

Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

CaptainOfCoit 4 months ago
> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:
But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.
For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.
- kmacdough 4 months ago
  
  > But that's something else, that's no longer just OCR ("Optical Character Recognition").
  Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.
  It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.
  
  1 reply →
- kmacdough 4 months ago
  
  > But that's something else, that's no longer just OCR ("Optical Character Recognition").
  Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.
  It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.
- eeixlk 4 months ago
  
  htceaad t nofdnsy lyruuieo sieerrr t owcope?
pietz 4 months ago

I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.

carschno 4 months ago

Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.

mormegil 4 months ago

This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.
pietz 4 months ago
We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.
- CraigRood 4 months ago
  
  I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the reality that they are not very accurate, nor can be properly measured. That said, there clearly are use cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.
  
  1 reply →
sramam 4 months ago
Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?
I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.
- lazide 4 months ago
  
  Often, the review LLM will also say everything is okay when it’s made up too.

raincole 4 months ago

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

jakewins 4 months ago
But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness
- wahnfrieden 4 months ago
  
  Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences
  
  3 replies →
- rkagerer 4 months ago
  
  Good libraries gave results with embedded confidence levels for each unit recognized.
red75prime 4 months ago

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).
KoolKat23 4 months ago

These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).
Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.
Google's image quality on uploads is still streets ahead of openai for instance btw.
wahnfrieden 4 months ago
Do any LLM OCRs give bounding boxes anyway? Per character and per block.
- kelvinjps10 4 months ago
  
  Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...
  Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
  I hope that this capability improves so I can use only Gemini API.
- dajonker 4 months ago
  
  Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.

peter-m80 4 months ago

No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them requiere a lot of human intervention

constantinum 4 months ago

I use LLMWhisperer[1] for OCR'ing old magazine ads. It preserves the layout and context. Example > https://postimg.cc/ts3vT7kG
https://pg.llmwhisperer.unstract.com/
pietz 4 months ago

Could you provide an example that fails? I'm interested in this.
jmkni 4 months ago
do you have an example of a particularly tricky one?
- ekianjo 4 months ago
  
  Just try old ads you will see how hard it gets

kbumsik 4 months ago

> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

coulix 4 months ago
This.
Any complex parent table span cell relationship still has low accuracy.
Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.
They will fail.
- bobsmooth 4 months ago
  
  Maybe I misunderstood the assignment but it seems to work for me.
  https://chatgpt.com/share/68f5f9ba-d448-8005-86d2-c3fbae028b...
  Edit: Just caught a mistake, transcribed one of the prices incorrectly.
  
  1 reply →
- pietz 4 months ago
  
  Maybe my imagination is limited or our documents aren't complex enough, but are we talking about realistic written documents? I'm sure you can take a screenshot of a very complex spreadsheet and it fails, but in that case you already have the data in structured form anyway, no?
  
  6 replies →
richardlblair 4 months ago

I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail through both OCR and OpenAI.
It's a hard (and very interesting) problem space.

llm_nerd 4 months ago

Complex documents is where OCR struggles mightily. If you have a simple document with paragraphs of text, sure OCR is pretty solved. If you have a complex layout with figures and graphs and supporting images and asides and captions and so on (basically any paper, or even trade documents), it absolutely falls apart.

And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.

There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.

Gazoche 4 months ago

There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on your problem domain.

Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.

But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.

themanmaran 4 months ago

> OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.

Gemini is definitely the best model for OCR. But it has a really high rate of "recitation" errors. Where it will determine the output token is too close to its training data and cut it off. Something like 10% of the time from our testing. Also it has this hilarious hallucination when you have a blank page in the document mix and it just makes up new info.

OpenAI is OK. GPT5 wasn't any better than 4o or 4.1. Main issues were: dropping content like headers/footers, loses it's mind on sideways pages, and will frequently refuse to read things like ID documents, health care forms, or things it judges to have too much PII.

robotswantdata 4 months ago

VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.

burpsnard 4 months ago

I've only used tesseract, 'recreationally', but i tried generating images of random chars to see what resolution/contrast/noise was minimally recognisable; shocked at how bad it was. heavily relies on language models of character sequences, pretty useless On 'line noise'

cle 4 months ago

That will not work with many of the world's most important documents because of information density. For example, dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting and features like strikethroughs, etc.

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.

veidr 4 months ago

Clearly-printed text to a sequence of characters is solved, for use cases that don't require 100% accuracy.

But not for semantic document structure — recognizing that the grammatically incomplete phrase in a larger font is a heading, recognizing subheadings and bullet lists, tables, etc.

Also not for handwritten text, text inside of images (signage and so forth), or damaged source material (old photocopies and scans created in the old days).

Those areas all seem to me where an LLM-based approach could narrow the gap between machine recognition and humans. You have to sort of reason about it from the context as a human to figure it out, too.

vintermann 4 months ago

OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.

KoolKat23 4 months ago

I agree, Gemini 2.5 models are excellent.

The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.

It would also be nice for there to be a good competitor.

constantinum 4 months ago

Why PDF parsing is Hell[1]:

Fixed layout and lack of semantic structure in PDFs.

Non-linear text flow due to columns, sidebars, or images.

Position-based text without contextual or relational markers.

Absence of standard structure tags (like in HTML).

Scanned or image-based PDFs requiring OCR.

Preprocessing needs for scanned PDFs (noise, rotation, skew).

Extracting tables from unstructured or visually complex layouts.

Multi-column and fancy layouts breaking semantic text order.

Background images and watermarks interfering with text extraction.

Handwritten text recognition challenges.

[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

simlevesque 4 months ago

> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

Can you explain more about your setup ? I have a quarter million pages I want to OCR.

blindriver 4 months ago

I attempted OCR using all of the open source models available about 3 months ago, including Llama 4. These were pngs of text using a regular font. Most produced garbage except Llama 4, and even then it was only about 90% accurate. Using OpenAI or Gemini produced much better results but the open source models were really bad.

6gvONxR4sf7o 4 months ago

OCR for printed documents is super robust, but handwriting, low res, and aligned recognition (not just image to "hello world" but also having "h is here in space e is here in space...) are all still well behind "basically solved."

Davidzheng 4 months ago

I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams almost all models still struggle. especially very complicated commutative diagrams.

darkwater 4 months ago

So, the mug with inspirational text says "Bountiful Potential"?

kelvinjps10 4 months ago

Google vision it's still better than Gemini at OCR, for example at getting bounding boxes.

sbinnee 4 months ago

Maybe for English. Other languages are very much not solved.

baobun 4 months ago

Chinese, especially handwritten.