Comment by simonw
16 hours ago
Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.
If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.
Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!
-> put Gemini 2.5 at the top of the pack
I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.
Humans: ~65% accurate
Gemini 1.5: ~72% accurate
Gemini 2.0: ~88% accurate
Gemini 2.5: ~92%* accurate
*Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D
Gemini 3.0 is rumored to drop any day now, will be very interesting to see the score that gets for your benchmark here.
As long as the ergonomics with the SDK stay the same. Jumping up to a new model this far in is something I don't want to contemplate wrestling with honestly. When we were forced off of 1.5 to 2.0 we found that our context strategy had to be completely reworked to recover and see better returns.
>Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks
Got it. The non-experts are holding it wrong!
The laymen are told "just use the app" or "just use the website". No need to worry about API keys or routers or wrapper scripts that way!
Sure.
Yet the laymen are expected to maintain a mental model of the failure modes and intended applications of Grok 4 vs Grok 4 Fast vs Gemini 2.5 Pro vs GPT-4.1 Mini vs GPT-5 vs Claude Sonnet 4.5...
It's a moving target. The laymen read the marketing puffery around each new model release and think the newest model is even more capable.
"This model sounds awesome. OpenAI does it again! Surely it can OCR my invoice PDFs this time!"
I mean, look at it:
And on and on it goes...
"The non-experts are holding it wrong!"
We aren't talking about non-experts here. Go read https://www.thalamusgme.com/blogs/methodology-for-creation-a...
They're clearly competent developers (despite mis-identifying GPT-5-mini as GPT-5o-mini) - but they also don't appear to have evaluated the alternative models, presumably because of this bit:
"This solution was selected given Thalamus utilizes Microsoft Azure for cloud hosting and has an enterprise agreement with them, as well as with OpenAI, which improves overall data and model security"
I agree with your general point though. I've been a pretty consistent voice in saying that this stuff is extremely difficult to use.
> The laymen
The solution architect, leads, product managers and engineers that were behind this feature are now laymen who shouldn't do their due diligence on a system to be used to do an extremely important task? They shouldn't test this system across a wide range of input pdfs for accuracy and accept nothing below 100%?