← Back to context

Comment by beernet

11 hours ago

Nothing new to see here. If you are still surprised by model hallucinations in 2025, it might be time for you to catch up or jump on the next hype bandwagon. Also, they reacted well:

> Once confirmed, we corrected the extracted grade immediately.

> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.

I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

It's true, but I think people have a misunderstanding that if you add search / RAG to ground the LLM, the LLM won't hallucinate. When in reality the LLM can still hallucinate, just convincingly in the language of whatever PDF it retrieved.

  • RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.

    The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.

    The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.

    • Is RAG the right tool for this? My understanding was that RAG uses vector similarity to compare queries (the extracted string) versus the search corpus (the PDF file) using vector similarities. The use case you describe is verification, which sounds like it would be better done with an exhaustive search via string comparison isntead of vector similarities.

      I could be totally wrong here.

      2 replies →

What's new or pertinent here is the specific real world use case and who it's impacting.

>It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

Again I would say that's why context is significant. You are strictly right, but it was applied in this instance for the purpose of faithfully representing grades. So I wouldn't say it's necessarily a matter of misunderstanding design, the errors are real after all, but the fact that it was entrusted for the purpose of faithful factual representation is what makes it an important story.

Hallucinations are also completely normal, "by design", just the output / experience of the system that produces it. It's just us who decided on the classification of what's real and what isn't, and looking at the state of things, we are not even very good on agreeing on the limit.

I know this sounds pedantic, but I think that the phenomenon itself is very human, so it's fascinating that we created something artificial that is a little bit like another human, and here it goes, producing similar issues. Next thing you know it will have emotions, and judgment.

Never thought about it from that perspective, but I think you're right. It is by design, not deceptive intent, just the infinite monkeys theorem where we've replaced randomness with pattern matching trained on massive datasets.

  • Another way to look at it is everything a LLM creates is a 'hallucination', some of these 'hallucinations' are more useful than others.

    I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.

    This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.

  • The way I've been putting it for a while is, "all they do is hallucinate—it's the only thing they do. Sometimes the hallucinations are useful."

  • The key idea is the model doesn't have any signal on "factual information." It has a huge corpus of training data and the assumption humans generally don't lie to each other when creating such a corpus.

    ... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).

> Nothing new to see here.

Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.

This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".

  • I completely agree with you. GP’s cynical take is an upvote magnet but doesn’t contribute to the discourse.

> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.

  • > And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

    Calling it a hallucination leads people to think that they just need to stop it from hallucinating.

    In layman's terms, it'd be better to understand that LLMs are schizophrenic. Even though that's not really accurate either.

    A better way to get across that the models really only understand reality by the way they've read about it and then we ask them for answers "in their own words" but that's a lot longer than "hallucination".

    It's like the gag in the 40 year old version where he describes breasts feeling like bags of sand.

I don’t understand the issue with the word “hallucination”.

If a model hallucinates it did do something wrong, something that we would ideally like to minimize.

The fact that it’s impossible to completely get rid of hallucinations is separate.

An electric car uses electricity, it’s a fundamental part of its design. But we’d still like to minimize electricity usage.

I also hate the term "hallucination", but for a different reason. A hallucination is a confusion of internal stimulus as an external input. The models simply make errors, have bad memory, are overconfident, are sampling from a fantasy world, or straight up lie; often at rates that are not dissimilar from humans. For models to truly hallucinate, develop delusions and all that good schizophrenia stuff we would need to have a truly recurrent structure that has enough time to go through something similar to the prodrome, and build up distortions and ideas.

TL;DR: being wrong, even very wrong != hallucination

> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

can you hear yourself? you are providing excuses for a computer system that produces erroneous output.

  • No he does not.

    He is not saying it's ok for this system to provides wrong answers, he is saying it's normal for informations from LLM to not be reliable and thus the issue is not coming from the LLM, but from the way it is being used.

  • We are in the late stage of the hype cycle for LLMs where the comments are becoming progressively ridiculous like for cryptocoins before the market crashed. The other day a user posted that LLMs are the new transistors or electricity.