Comment by itissid

5 months ago

Wait isn't there atleast a two step process here one is semantic segmentation followed by a method like texttract for text - to avoid hallucinations?

One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?

> accuracy was like 96% of that of the vendor and price was significantly cheaper.

I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?

22 comments

itissid

themanmaran 5 months ago

One thing people always forget about traditional OCR providers (azure, tesseract, aws textract, etc.) is that they're ~85% accurate.

They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?

kapitalx 5 months ago
I'm the founder of https://doctly.ai, also pdf extraction.
The hallucination in LLM extraction is much more subtle as it will rewrite full sentences sometimes. It is much harder to spot when reading the document and sounds very plausible.
We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence. That way you have the option of trading compute and cost for accuracy.
- LeafItAlone 5 months ago
  
  >We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence.
  I’m interested to hear more about the validation process here. In my limited experience, I’ve sent the same “document” to multiple LLMs and gotten differing results. But sometimes the “right” answer was in the minority of responses. But over a large sample (same general intent of document, but very different possible formats of the information within), there was no definitive winner. We’re still working on this.
- nnurmanov 5 months ago
  
  What if you use a different prompt to check the result, did this work? I am thinking to use this approach, but now I think maybe it is better to use two different LLM like you do.
anon373839 5 months ago
It’s a question of scale. When a traditional OCR system makes an error, it’s confined to a relatively small part of the overall text. (Think of “Plastics” becoming “PIastics”.) When a LLM hallucinates, there is no limit to how much text can be made up. Entire sentences can be rewritten because the model thinks they’re more plausible than the sentences that were actually printed. And because the bias is always toward plausibility, it’s an especially insidious problem.
- themanmaran 5 months ago
  
  It's a bit of a pick your poison situation. You're right that traditional OCR mistakes are usually easy to catch (except when you get $30.28 vs $80.23). Compared to LLM hallucinations that are always plausibly correct.
  But on the flip side, layout is often times the biggest determinant of accuracy, and that's something LLMs do a way better job on. It doesn't matter if you have 100% accurate text from a table, but all that text is balled into one big paragraph.
  Also the "pick the most plausible" approach is a blessing and a curse. A good example is the handwritten form here [1]. GPT 4o gets the all the email addresses correct because it can reasonably guess these people are all from the same company. Whereas AWS treats them all independently and returns three different emails.
  [1] https://getomni.ai/ocr-demo
miki123211 5 months ago

The difference is the kind of hallucinations you get.
Traditional OCR is more likely to skip characters, or replace them with similar -looking ones, so you often get AL or A1 instead of AI for example. In other words, traditional spelling mistakes. LLMs can do anything from hallucinating new paragraphs to slightly changing the meaning of a sentence. The text is still grammatically correct, it makes sense in the context, except that it's not what the document actually said.
I once gave it a hand-written list of words and their definitions and asked it to turn that into flashcards (a json array with "word" and "definition"). Traditional OCR struggled with this text, the results were extremely low-quality, badly formatted but still somewhat understandable. The few LLMs I've tried either straight up refused to do it, or gave me the correct list of words, but entirely hallucinated the definitions.
Scoundreller 5 months ago
> You literally get back characters + confidence intervals.
Oh god, I wish speech to text engines would colour code the whole thing like a heat map to focus your attention to review where it may have over-enthusiastically guessed at what was said.
You no knot.
- gioazzi 5 months ago
  
  We did this for a speech to text solution in healthcare. Doctors would always review everything that was transcribed manually (you don’t want hallucinations in your prescription), and using a heatmap it was trivial to identify e.g. drugs that were pretty much always misunderstood by STT
somebehemoth 5 months ago

I know nothing about OCR providers. It seems like OCR failure would result in gibberish or awkward wording that might be easy to spot. Doesn't the LLM failure mode assert made up truths eloquently that are more difficult to spot?
nyarlathotep_ 5 months ago

> is that they're ~85% accurate.
Speaking from experience, you need to double check "I" and "l" and "1" "0" and "O" all the time, accuracy seems to depend on the font and some other factors.
have a util script I use locally to copy some token values out of screenshots from a VMWare client (long story) and I have to manually adjust 9/times.
How relevant that is or isn't depends on the use case.

itissid 5 months ago

For an OCR company I imagine it is unconscionable to do this because if you would say OCR for an Oral History project for a library and you made hallucination errors, well you've replaced facts with fiction. Rewriting history? What the actual F.

phatfish 5 months ago
Probaly totally fine for a "fintech" (Crypto?) though. Most of them are just burning VC money anyway. Maybe a lucky customer gets a windfall because Gemini added some zeros.
- rapind 5 months ago
  
  I think you can just ask DeepSeek to create a coin for you at this point, and with the recent elimination of any oversight, you can automate your rug pulls...
threecheese 5 months ago

Normal OCR (like Tesseract) can be wrong as well (and IMO this happens frequently). It won’t hallucinate/straight make shit up like an LLM, but a human needs to review OCR results if the workload requires accuracy. Even across multiple runs of the same image an OCR can give different results (in some scenarios). No OCR system is perfectly accurate, they all use some kind of machine learning/floating point/potentially nondeterministic tech.

nthingtohide 5 months ago

Can confirm using gemini, some figure numbers were hallucinated. I had to cross-check each row to make sure data extracted is correct.

godapi 5 months ago

use different models to extract the page and cross check against each other. generally reduces issues alot

basch 5 months ago

Wouldn’t the temperature on something like OCR be very low. You want the same result every time. Isn’t some part of hallucination the randomness of temperature?

manmal 5 months ago
I can imagine reducing temp too much will lead to garbage results in situations where glyphs are unreadable.
- 2rsf 5 months ago
  
  Isn't it a good thing in this case? this is fintec, so if in doubt get a human to look at it
- basch 5 months ago
  
  so you want every time you scan something illegible, for it to return a different result.

serjester 5 months ago

The LLM's are near perfect (maybe parsing I instead of 1) - if you're using the outputs in the context of RAG, your errors are likely much much higher in the other parts of your system. Spending a ton of time and money chasing 9's when 99% of your system's errors have totally different root causes seems like a bad use of time (unless they're not).