Comment by bigzyg33k

13 hours ago

RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.

The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.

The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.

Is RAG the right tool for this? My understanding was that RAG uses vector similarity to compare queries (the extracted string) versus the search corpus (the PDF file) using vector similarities. The use case you describe is verification, which sounds like it would be better done with an exhaustive search via string comparison isntead of vector similarities.

I could be totally wrong here.

  • Some people define RAG as having to use vector search, others (myself included) define RAG as any technique that retrieves additional relevant context to help generate the response, which can include triggering things like full-text search queries or even grep (increasingly common thanks to Claude Code et al).

  • RAG is just "Retrieval Augmented Generation", vector similarity is one way to do that retrieval but not the only. Though you are right, there is really no retrieval step augmenting the generation here, more like just a validation step stuck on the end.

    Though I imagine scenarios where the PDF is just an image (e.g. a scan of a form), and thus the validation would not work.