← Back to context

Comment by walkabout

1 day ago

I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.

Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.

Eh, I guess we’re looked at different PDFs and models. Gemini 2.5 flash is very good, and Gemini 2.0 and Claude 3.7 were passable at parsing out complicated tables in image chunks, and we did have a fairly small prompt that worked >90% of cases. Where we had failures they were almost always in asking the model to do something infeasible (like parse a table where the header was on a previous, not provided page).

If you have a better way to parse PDFs using opencv or whatever, please provide this service and people will buy it for their RAG chat bots or to train vlms.