Comment by kbyatnal
17 days ago
It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.
I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.
This requires things like:
- state-of-the-art parsing powered by VLMs and OCR
- multi-step extraction powered by semantic chunking, bounding boxes, and citations
- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)
- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy
- evaluation and benchmarking tools
- fine-tuning pipelines that turn reviewed corrections —> custom models
Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.
> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.
I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.
Seems like you could solve hallucinations by repeating the task multiple times. Non-hallucinations will be the same. Hallucinations will be different. Discard and retry hallucinated sections. This increases cost by a fixed multiple, but if cost of tokens continues to fall that's probably perfectly fine.
If you see above, someone is using a second and even third LLM to correct LLM outputs, I think it is the way to minimize hallucinations.
> I think it is the way to minimize hallucinations
Or maybe the way to add new hallucinations. Nobody really knows. Just trust us bro, this is groundbreaking disruptive technology.
I think professional services will continue to use OCRs in one way or another, because it's simply too cheap, fast, and accurate. Perhaps, multi-modal models can help address shortcomings of OCRs, like layout detection and guessing unrecognizable characters.