GPT-5o-mini hallucinates medical residency applicant grades

6 hours ago (thalamusgme.com)

Hi HN, submitting from a burner since I'm an applicant this current medical residency admissions cycle. I thought it was interesting to show the real world implications of using LLMs to extract information from PDFs. For context, thalamus is a company that handles the "backend" for residency programs and all the applications they receive (including handling who to invite for interviews, etc). One of the more important factors in deciding applicant competitiveness is their medical school performance (their grades), but that information is buried in PDFs sent by schools (often not standardized). So this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...). Some programs have noticed there is a discrepancy between extracted vs reported grades (often in the direction of hallucinating "fails") and brought it to the attention of thalamus. Unfortunately, it doesn't look like the main company is discontinuing usage of the tool.

Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.

EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist

  • It's amazing how much of "inter organization information flow" still happens over PDFs and/or just FTP'ing files around.

    A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.

    The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

    As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.

    I write more about the FTP side of things in the Twitter thread below: https://x.com/alexpotato/status/1809579426687983657

    • > the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

      I'm interested in what the conditions were that didn't let you reject those kind of transactions, or blacklist them for the future.

      We hear about companies firing/banning unprofitable customers sometimes, surprised it doesn't happen more often honestly.

  • Thank you for sharing this.

    It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.

    • I imagine they would love to create a simple API for this, but the problem is convincing thousands of schools to use that API.

      If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.

      2 replies →

    • The trouble is getting people to use your API - in this case med schools, but it can be much, much worse (more and smaller organizations sending you data, and in some industries you have a legal obligation to accept it in any format they care to send).

  • Why don't they just email a form after/when you apply and you fill in all the grades in a structured data way? How many grades are we talking about here. Then the PDF would just be the proof that your grades were real.

  • How should medical residency work? Like how should admissions work, is the match doing what you would want it to do, is there a radical alternative, etc? You have our attention!

    Don’t tell me the grades should be gathered accurately. Obviously. Tell me something bigger.

  • > this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...).

    Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it

    • Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.

      I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.

      9 replies →

    • I would assume they OCR first, then extract whatever info they need from the result using LLMs

      Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."

      2 replies →

    • I would love to hear more about the solutions you have in mind, if you're willing.

      The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.

    • The parent says

      > that information is buried in PDFs sent by schools (often not standardized).

      I don't think OCR will help you there.

      An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.

      2 replies →

Nothing new to see here. If you are still surprised by model hallucinations in 2025, it might be time for you to catch up or jump on the next hype bandwagon. Also, they reacted well:

> Once confirmed, we corrected the extracted grade immediately.

> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.

I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

  • What's new or pertinent here is the specific real world use case and who it's impacting.

    >It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

    Again I would say that's why context is significant. You are strictly right, but it was applied in this instance for the purpose of faithfully representing grades. So I wouldn't say it's necessarily a matter of misunderstanding design, the errors are real after all, but the fact that it was entrusted for the purpose of faithful factual representation is what makes it an important story.

  • It's true, but I think people have a misunderstanding that if you add search / RAG to ground the LLM, the LLM won't hallucinate. When in reality the LLM can still hallucinate, just convincingly in the language of whatever PDF it retrieved.

    • RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.

      The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.

      The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.

      3 replies →

  • Hallucinations are also completely normal, "by design", just the output / experience of the system that produces it. It's just us who decided on the classification of what's real and what isn't, and looking at the state of things, we are not even very good on agreeing on the limit.

    I know this sounds pedantic, but I think that the phenomenon itself is very human, so it's fascinating that we created something artificial that is a little bit like another human, and here it goes, producing similar issues. Next thing you know it will have emotions, and judgment.

  • Never thought about it from that perspective, but I think you're right. It is by design, not deceptive intent, just the infinite monkeys theorem where we've replaced randomness with pattern matching trained on massive datasets.

    • Another way to look at it is everything a LLM creates is a 'hallucination', some of these 'hallucinations' are more useful than others.

      I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.

      This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.

    • The way I've been putting it for a while is, "all they do is hallucinate—it's the only thing they do. Sometimes the hallucinations are useful."

    • The key idea is the model doesn't have any signal on "factual information." It has a huge corpus of training data and the assumption humans generally don't lie to each other when creating such a corpus.

      ... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).

  • > I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

    While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

    Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.

    • > And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

      Calling it a hallucination leads people to think that they just need to stop it from hallucinating.

      In layman's terms, it'd be better to understand that LLMs are schizophrenic. Even though that's not really accurate either.

      A better way to get across that the models really only understand reality by the way they've read about it and then we ask them for answers "in their own words" but that's a lot longer than "hallucination".

      It's like the gag in the 40 year old version where he describes breasts feeling like bags of sand.

  • > Nothing new to see here.

    Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.

    This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".

    • I completely agree with you. GP’s cynical take is an upvote magnet but doesn’t contribute to the discourse.

  • I don’t understand the issue with the word “hallucination”.

    If a model hallucinates it did do something wrong, something that we would ideally like to minimize.

    The fact that it’s impossible to completely get rid of hallucinations is separate.

    An electric car uses electricity, it’s a fundamental part of its design. But we’d still like to minimize electricity usage.

  • I also hate the term "hallucination", but for a different reason. A hallucination is a confusion of internal stimulus as an external input. The models simply make errors, have bad memory, are overconfident, are sampling from a fantasy world, or straight up lie; often at rates that are not dissimilar from humans. For models to truly hallucinate, develop delusions and all that good schizophrenia stuff we would need to have a truly recurrent structure that has enough time to go through something similar to the prodrome, and build up distortions and ideas.

    TL;DR: being wrong, even very wrong != hallucination

  • > I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

    can you hear yourself? you are providing excuses for a computer system that produces erroneous output.

    • No he does not.

      He is not saying it's ok for this system to provides wrong answers, he is saying it's normal for informations from LLM to not be reliable and thus the issue is not coming from the LLM, but from the way it is being used.

    • We are in the late stage of the hype cycle for LLMs where the comments are becoming progressively ridiculous like for cryptocoins before the market crashed. The other day a user posted that LLMs are the new transistors or electricity.

School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.

Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).

What we've seen help in these cases are:

1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.

2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.

The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).

  • Would it help a lot to run it through multiple different AI systems and verify that they agree on the result?

    • Yeah that can occasionally work and something we've tested, but it introduces a lot of noise unfortunately and makes systematic evals difficult.

Frustrating that their official recommendation is to verify the grades manually.

If a tool is designed to extract the grades for easy access, do we really believe that the end users will then verify the grades manually to confirm the output? If they’re doing that, why use the tool at all?

Maybe the tool can extract what it believes is the grades section and show a screenshot for a human to interpret.

  • Because the contract has already been signed, they can't guarantee it works right, and they don't want to be open to lawsuits. "You, mister wrongly-denied applicant, cannot sue us; we specifically told them to check all grades manually!"

  • This is why this particular emperor has no clothes. They keep trying to jam AI into stuff to make it "easier", but the LLMs, by their very nature do the tasks in lossy or incorrect ways. Imagine if Microsoft had sold Excel with a "be sure to verify all the calculations" caveat.

  • > If they’re doing that, why use the tool at all?

    Because the people purchasing the tool aren't the ones who will actually use it. The former get a "Deployed AI tooling to X to increase productivity by X%" on their resume. The latter get left to deal with the mess.

Lots of comments in here that seem to have missed that this is about using vision-LLMs for OCR.

This makes it a slightly different issue from "hallucination" as seen in text based models. The model (which I think we can assume is GPT-5-mini in this case) is being fed scanned images of PDFs and is incorrectly reading the data from them.

Is this still a hallucination? I've been unable to identify a robust definition of that term, so it's not clearly wrong to call a model misinterpreting a document a "hallucination" even though it feels to me like a different category of mistake to an LLM inventing the title of a non-existent paper or lawsuit.

  • These kinds of errors have always existed and will always exist there is no perfect way to extract info from documents like this.

    • The models really are getting better though. Compare Gemini 1.5 and Gemini 2.5 on the same PDF document (I've done this a bunch) and you can see the difference.

      The open question is how much better they need to get before they can be deployed for situations like this that require a VERY high level of reliability.

      2 replies →

Am I crazy or has text parsing been mastered long before AI. Why is GPT being used in this scenario in the first place ?

  • Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.

  • Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...

  • It seems like a default mode for AI should be to generate repeatable Regex for text extraction.

    • Unfortunately many PDFs don't even internally represent text in a contiguous way.

  • Tables in PDFs still confuse traditional OCR engines. VLMs do better in some cases (though not this one, apparently).

While I don't want to discount the work of any physician-founded org knowing the pain they go through from working with them after they've seen 18 patients in a days work, this still just just looks like bad software. With no testing, no internal bench.

Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?

It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.

Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.

So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?

I keep circling this with AI and I'm not really sure what to do with it. They mention that the AI is meant to be used as reference only in the linked article but what does that actually mean? Who is checking who? Is the AI filling out the data from what it sees in the PDF and the user is expected to check it or is the user filling out the data and the AI is expected to check it?

Is the cost of AI useful if all you're doing is something like 'linting' the extraction? How do you guarantee that people really, truly, are doing the same work as before and not just blindly clicking 'looks good'. What is the value of the AI telling you something when you cannot tell if it is lying?

  • Yeah, I've seen this "for reference only" wording in many places, often used as a sort of disclaimer on stuff that could be wrong, but I have absolutely no idea what it means in that context. To me "reference" implies comprehensive, high quality information that I can refer to when I need to know some obscure detail of something.

    Is there some legal context in which this phrase has a specific meaning, perhaps?

Using a mini model for this seems grossly irresponsible. I've been doing some work testing models for similar extraction tasks (nothing where a failure affects someone's grade or anything) and gpt mini / Gemini flash simply can't do this sort of thing. Using anything less than the highest model with reasoning, you're guaranteed to get this sort of thing happening.

It is very tempting to do it, obviously, with the cost difference, but it's not worth it. But on the other hand, people talk about LLMs with a broad brush and I don't know, there's still testing but I would be surprised to hear that GPT-5-pro with thinking had an issue like this.

I regularly use LLM-as-OCR and find it really helpful to:

1. Minimize the number of PDF pages per context/call. Don't dump a giant document set into one request. Break them into the smallest coherent chunks.

2. In a clean context, re-send the page and the extracted target content and ask the model to proofread/double-check the extracted data.

3. Repeat the extraction and/or the proofreading steps with a different model and compare the results.

4. Iterate until the proofreadings pass without altering the data, or flag proofreading failures for stronger models or human intervention.

I see _even with search/RAG_ LLMs hallucinate. They just hallucinate more convincingly in the language of the documents you retrieved.

So you really have to double check when researching information that really matters.

This sucks. Residency match is stressful as it is, and adding systems like these just make the experience even worse for the applicants.

Source: spouse matched in 2018. It was one of the most stressful periods of our lives.

Seems like hallucination will always be an issue for predict the next word training. Maybe we need to rethink pretraining .

Not only did the AI hallucinate the applicant grade, but also the model name!

GPT-5o-whatever ain’t a thing.

The irony is sweeeeet

It's predicting the next token by statistical approximation. Hallucination vs fact is an ad-hoc distinction we impose on the result to suit our purpose.

I see your point here but please take a look at the “standard” unstructured pdf extraction algos they have a lot of problems as well. Llm based extraction is still (on avergad) a big improvement.

There is no such thing as GPT-5o-mini, or GPT-5o. Concerning that the methodology seems to repeat the same error, not just the submitted title.

  • https://www.thalamusgme.com/blogs/methodology-for-creation-a...

    they actually write it: > For this cycle, we have refined our model architecture, expanded the catalog of medical schools and grading schemas, and upgraded to include the GPT-5o-mini model for increased accuracy and efficiency. Real-time validation has also been strengthened to provide programs with more reliable percentile and grade distribution data. Together, these enhancements make transcript normalization an even more powerful tool to support fair, consistent, and data-driven review in the transition to residency.

  • they probably mean gpt-5-low. but the small models are bad for parsing data where the data has strong implications

Nothing new to see here. Human also hallucinates, as you can tell from the model name.

>Reviewers are strongly encouraged to verify all information against the applicant’s official PDF transcript. This reminder is also displayed directly within the product.

This is not how this works. You know people will not do this. In fact the whole value proposition hinges on people not doing this. If the information needs to be verified by a human, then it takes more time than just going through the document.

If your product can not be trusted, then it can not be used to make important decisions. Pushing the responsibility to not use your product on the user is absurd and does not make your actions any less negligent.

Semi-related but Sonnet 4.5 drives me absolutely insane.

I tell it a date, like March 2024 as the start, and October 2025 as the current month.

It still thinks that is 7 months somehow... and this is Anthropic's latest model..

Wow. Never ceases to amaze me how some people in these comment sections remain blind to the power of Artificial Intelligence (AI). Have you not tried prompting the model correctly? My startup gets 0 hallucinations on the latest iteration of Claude Sonnet using a custom proprietary reflecting RAG framework inspired by ontology.

  • It never ceases to amaze me when startup founders claim that every problem is the same. Some use cases (like parsing text out of PDF) can’t be distilled down to a prompt.

LLM can't hallucinate. Correct phrase would be "GPT-5o-mini generates medical residency applicant grades". Everywhere you see word hallucinate in regards of a program output, it should be replaced with generate for clarity.

  • If you're being 100% literal, sure. But language evolves and it's the accepted term for the concept. OpenAI themselves uses the phrase - https://openai.com/index/why-language-models-hallucinate/

    • OpenAI are the last people who I would take as a reference, because they are financially motivated to keep the charade of a "thinking" LLM or so called "AI". That's why they are widely using anthropomorphic terms like "hallucination" or "reasoning" or "thinking", while their computer programs can't do neither of those things. LLM companies sometimes even expose their hypocrisy. My favorite example for now is when Antropic showed in their own paper that asking LLM how it "reasoned" through calculating a sum of numbers doesn't match reality at all, it's all generated slop.

      This is why it is important that users (us) don't fall into the anthropomorphism trap and call programs what they are are and what they really do. Especially important since general populace seems to be deluded by the OpenAI and Anthropic aggressive lies and believe that LLMs can think.