Comment by otoburb
17 days ago
>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.
I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.