Comment by fxwin

11 hours ago

I would assume they OCR first, then extract whatever info they need from the result using LLMs

Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."

2 comments

fxwin

simonw 11 hours ago

It's a bit difficult to derive exactly what they're using here. There's quite a lot of detail in https://www.thalamusgme.com/blogs/methodology-for-creation-a... but still mentions "OCR models" separately from LLMs, including a diagram that shows OCR models as a separate layer before the LLM layer.

But... that document also says:

"For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"

Which makes it sounds like they were using vision-LLMs for that OCR step.

Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.

fxwin 10 hours ago

Thanks, I didn't see that part!