Comment by aprilthird2021

11 hours ago

> this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...).

Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it

20 comments

aprilthird2021

mattnewton 11 hours ago

Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.

I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.

simonw 10 hours ago
Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.
If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.
Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!
- ajcp 8 hours ago
  
  -> put Gemini 2.5 at the top of the pack
  I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.
  Humans: ~65% accurate
  Gemini 1.5: ~72% accurate
  Gemini 2.0: ~88% accurate
  Gemini 2.5: ~92%* accurate
  *Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D
  
  2 replies →
- slacktivism123 9 hours ago
  
  >Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks
  Got it. The non-experts are holding it wrong!
  The laymen are told "just use the app" or "just use the website". No need to worry about API keys or routers or wrapper scripts that way!
  Sure.
  Yet the laymen are expected to maintain a mental model of the failure modes and intended applications of Grok 4 vs Grok 4 Fast vs Gemini 2.5 Pro vs GPT-4.1 Mini vs GPT-5 vs Claude Sonnet 4.5...
  It's a moving target. The laymen read the marketing puffery around each new model release and think the newest model is even more capable.
  "This model sounds awesome. OpenAI does it again! Surely it can OCR my invoice PDFs this time!"
  I mean, look at it:
  GPT‑5 not only outperforms previous models on benchmarks and answers questions more quickly, but—most importantly—is more useful for real-world queries. GPT‑5 is our best model yet for health-related questions, empowering users to be informed about and advocate for their health. The model scores significantly higher than any previous model on HealthBench , an evaluation we published earlier this year based on realistic scenarios and physician-defined criteria. GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 without tools), real-world coding (74.9% on SWE-bench Verified, 88% on Aider Polyglot), multimodal understanding (84.2% on MMMU), and health (46.2% on HealthBench Hard) The model excels across a range of multimodal benchmarks, spanning visual, video-based, spatial, and scientific reasoning. Stronger multimodal performance means ChatGPT can reason more accurately over images and other non-text inputs—whether that’s interpreting a chart, summarizing a photo of a presentation, or answering questions about a diagram.
  And on and on it goes...
  
  2 replies →
walkabout 9 hours ago
I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.
Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.
- mattnewton 4 hours ago
  
  Eh, I guess we’re looked at different PDFs and models. Gemini 2.5 flash is very good, and Gemini 2.0 and Claude 3.7 were passable at parsing out complicated tables in image chunks, and we did have a fairly small prompt that worked >90% of cases. Where we had failures they were almost always in asking the model to do something infeasible (like parse a table where the header was on a previous, not provided page).
  If you have a better way to parse PDFs using opencv or whatever, please provide this service and people will buy it for their RAG chat bots or to train vlms.
0x457 7 hours ago

Would it be helpful if LLM creates bounding boxes for "traditional" OCR to work on? I.e. allowing extraction of information of arbitrary PDF as if it was a "particular pdf"

williamdclt 10 hours ago

The parent says

> that information is buried in PDFs sent by schools (often not standardized).

I don't think OCR will help you there.

An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.

aprilthird2021 10 hours ago
Don't most jobs do OCR on the resumes sent in for employment? I get that a resume is a more standard format. Maybe that's the rub
- simonw 9 hours ago
  
  The challenge here is that it's not just OCR for extracting text from a resume, this is about extracting grades from school transcripts. That's a LOT harder, see this excellent comment: https://news.ycombinator.com/item?id=45581480

fxwin 9 hours ago

I would assume they OCR first, then extract whatever info they need from the result using LLMs

Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."

simonw 8 hours ago
It's a bit difficult to derive exactly what they're using here. There's quite a lot of detail in https://www.thalamusgme.com/blogs/methodology-for-creation-a... but still mentions "OCR models" separately from LLMs, including a diagram that shows OCR models as a separate layer before the LLM layer.
But... that document also says:
"For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"
Which makes it sounds like they were using vision-LLMs for that OCR step.
Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.
- fxwin 7 hours ago
  
  Thanks, I didn't see that part!

daemonologist 11 hours ago

I would love to hear more about the solutions you have in mind, if you're willing.

The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.

bbarnett 11 hours ago

Welcome to the world of greybeards, baffled by everyone using AWS at 100s to 100000s of times the cost of your own servers.

lazystar 10 hours ago

spectre/meltdown, finding out your 6 month order of ssd's was stolen after opening empty boxes in the datacenter, and having to write RCA's for customers after your racks go over the PSU's limit are things ya'll greybeards seem to gloss over in your calculations, heh