Comment by faxmeyourcode

5 months ago

I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

This is giving me hope that it's possible.

6 comments

faxmeyourcode

anirudhb99 5 months ago

(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.

otoburb 5 months ago

>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.

[1] https://github.com/dgunning/edgartools

faxmeyourcode 5 months ago

I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.

barrenko 5 months ago

If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.

jgalt212 5 months ago

isn't everyone on iXBRL now? Or are you struggling with historical filings?

faxmeyourcode 5 months ago

XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.