Comment by bilekas

4 months ago

Am I crazy or has text parsing been mastered long before AI. Why is GPT being used in this scenario in the first place ?

8 comments

bilekas

mattnewton 4 months ago

Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.

fragmede 4 months ago

Tables in PDFs still confuse traditional OCR engines. VLMs do better in some cases (though not this one, apparently).

tdeck 4 months ago

Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...

hansonkd 4 months ago

It seems like a default mode for AI should be to generate repeatable Regex for text extraction.

tdeck 4 months ago

Unfortunately many PDFs don't even internally represent text in a contiguous way.

hluska 4 months ago

Not in PDF.

bilekas 4 months ago

No, not in the PDF spec, but are we allowed process images of every page, text adaption, etc. Where does GPT come in ?