Comment by anon373839

17 days ago

> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.

I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.

Seems like you could solve hallucinations by repeating the task multiple times. Non-hallucinations will be the same. Hallucinations will be different. Discard and retry hallucinated sections. This increases cost by a fixed multiple, but if cost of tokens continues to fall that's probably perfectly fine.

If you see above, someone is using a second and even third LLM to correct LLM outputs, I think it is the way to minimize hallucinations.

  • > I think it is the way to minimize hallucinations

    Or maybe the way to add new hallucinations. Nobody really knows. Just trust us bro, this is groundbreaking disruptive technology.