Comment by applgo443
16 days ago
Why are traditional OCRs better in terms of hallucination and confidence scores?
Can we use logprobs of LLM as confidence scores?
16 days ago
Why are traditional OCRs better in terms of hallucination and confidence scores?
Can we use logprobs of LLM as confidence scores?
Traditional OCRs are trained for a single task: recognize characters. They do this through visual features (and sometimes there's an implicit (or even explicit) "language" model: see https://arxiv.org/abs/1805.09441). As such, the extent of their "hallucination", or errors, is when there's ambiguity in characters, e.g. 0 vs O (that's where the implicit language model comes in). Because they're trained with a singular purpose, you would expect their confidence scores (i.e. logprobs) to be well calibrated. Also, depending on the OCR model, you usually do a text detection (get bounding boxes) followed by a text recognition (read the characters), and so it's fairly local (you're only dealing with a small crop).
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.