← Back to context

Comment by bilekas

11 hours ago

Am I crazy or has text parsing been mastered long before AI. Why is GPT being used in this scenario in the first place ?

Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.

Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...

It seems like a default mode for AI should be to generate repeatable Regex for text extraction.

  • Unfortunately many PDFs don't even internally represent text in a contiguous way.

Tables in PDFs still confuse traditional OCR engines. VLMs do better in some cases (though not this one, apparently).