Comment by anonu
16 days ago
Ingesting PDFs accurately is a noble goal which will no doubt be solved as LLMs get better. However, I need to point out that the financial statement example used in the article already has a solution: iXBRL.
Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.
Of course, financial documents are a narrow subset of the problem.
Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.
I can't help but feel that PDFs could probably be more portable as their acronym indicates.
Just call out -- even better, this library (even in active development) is blowing every other SEC tool I've found out the of the water
https://github.com/dgunning/edgartools