Comment by anonu

1 year ago

Ingesting PDFs accurately is a noble goal which will no doubt be solved as LLMs get better. However, I need to point out that the financial statement example used in the article already has a solution: iXBRL.

Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.

Of course, financial documents are a narrow subset of the problem.

Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.

I can't help but feel that PDFs could probably be more portable as their acronym indicates.

1 comment

anonu

tomrod 1 year ago

Just call out -- even better, this library (even in active development) is blowing every other SEC tool I've found out the of the water

https://github.com/dgunning/edgartools