Comment by anon373839

5 months ago

> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.

I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.

3 comments

anon373839

ALittleLight 5 months ago

Seems like you could solve hallucinations by repeating the task multiple times. Non-hallucinations will be the same. Hallucinations will be different. Discard and retry hallucinated sections. This increases cost by a fixed multiple, but if cost of tokens continues to fall that's probably perfectly fine.

nnurmanov 5 months ago

If you see above, someone is using a second and even third LLM to correct LLM outputs, I think it is the way to minimize hallucinations.

otabdeveloper4 5 months ago

> I think it is the way to minimize hallucinations
Or maybe the way to add new hallucinations. Nobody really knows. Just trust us bro, this is groundbreaking disruptive technology.