Comment by medicalthrow

18 hours ago

Hi HN, submitting from a burner since I'm an applicant this current medical residency admissions cycle. I thought it was interesting to show the real world implications of using LLMs to extract information from PDFs. For context, thalamus is a company that handles the "backend" for residency programs and all the applications they receive (including handling who to invite for interviews, etc). One of the more important factors in deciding applicant competitiveness is their medical school performance (their grades), but that information is buried in PDFs sent by schools (often not standardized). So this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...). Some programs have noticed there is a discrepancy between extracted vs reported grades (often in the direction of hallucinating "fails") and brought it to the attention of thalamus. Unfortunately, it doesn't look like the main company is discontinuing usage of the tool.

Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.

EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist

Hi wondering if you could message me at shane.shifflett@dowjones.com or via signal at 929 638 0009? https://www.wsj.com/news/author/shane-shifflett

  • You are so brave. I get like 8 spammers calling me daily about loans like I owe them money, and that's without blasting my phone number out to the internet.

    • If you're not using it yet, then I recommend enabling the call screening feature on your phone, it has basically reduced my number of spam calls down to zero. It's available on iPhones and pixels and Samsung phones(and probably others?)

      1 reply →

It's amazing how much of "inter organization information flow" still happens over PDFs and/or just FTP'ing files around.

A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.

The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.

I write more about the FTP side of things in the Twitter thread below: https://x.com/alexpotato/status/1809579426687983657

  • > the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

    I'm interested in what the conditions were that didn't let you reject those kind of transactions, or blacklist them for the future.

    We hear about companies firing/banning unprofitable customers sometimes, surprised it doesn't happen more often honestly.

Thank you for sharing this.

It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.

  • I imagine they would love to create a simple API for this, but the problem is convincing thousands of schools to use that API.

    If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.

    • they're essentially an ATS SAAS for medical school, if they have enough schools or enough prestigious schools, they can ask for whatever they want and the applicant schools would oblige. cheeky way to make it happen overnight: give a slight advantage to transcripts that are submitted digitally- the conversion would be complete within months.

      1 reply →

  • The trouble is getting people to use your API - in this case med schools, but it can be much, much worse (more and smaller organizations sending you data, and in some industries you have a legal obligation to accept it in any format they care to send).

Why don't they just email a form after/when you apply and you fill in all the grades in a structured data way? How many grades are we talking about here. Then the PDF would just be the proof that your grades were real.

  • Because you'd have to get thousands of schools to agree to using the same format.

    • Does the student not have access to the grades? As they are applying to medical school, a few hours of drudgery form filling will still be the easiest part of the process.

      2 replies →

> this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...).

Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it

  • Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.

    I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.

    • Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.

      If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.

      Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!

      6 replies →

    • I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.

      Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.

      1 reply →

    • Would it be helpful if LLM creates bounding boxes for "traditional" OCR to work on? I.e. allowing extraction of information of arbitrary PDF as if it was a "particular pdf"

  • The parent says

    > that information is buried in PDFs sent by schools (often not standardized).

    I don't think OCR will help you there.

    An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.

  • I would love to hear more about the solutions you have in mind, if you're willing.

    The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.

  • I would assume they OCR first, then extract whatever info they need from the result using LLMs

    Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."

    • It's a bit difficult to derive exactly what they're using here. There's quite a lot of detail in https://www.thalamusgme.com/blogs/methodology-for-creation-a... but still mentions "OCR models" separately from LLMs, including a diagram that shows OCR models as a separate layer before the LLM layer.

      But... that document also says:

      "For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"

      Which makes it sounds like they were using vision-LLMs for that OCR step.

      Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.

      1 reply →

  • Welcome to the world of greybeards, baffled by everyone using AWS at 100s to 100000s of times the cost of your own servers.

    • spectre/meltdown, finding out your 6 month order of ssd's was stolen after opening empty boxes in the datacenter, and having to write RCA's for customers after your racks go over the PSU's limit are things ya'll greybeards seem to gloss over in your calculations, heh

How should medical residency work? Like how should admissions work, is the match doing what you would want it to do, is there a radical alternative, etc? You have our attention!

Don’t tell me the grades should be gathered accurately. Obviously. Tell me something bigger.