← Back to context

Comment by faxmeyourcode

17 days ago

I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

This is giving me hope that it's possible.

(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.

>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.

[1] https://github.com/dgunning/edgartools

  • I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.

If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.

isn't everyone on iXBRL now? Or are you struggling with historical filings?

  • XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.