← Back to context

Comment by kbumsik

3 months ago

> realistic written documents?

Just get a DEF 14A (Annual meeting) filing of a company from SEC EDGAR.

I have seen so many mistakes when looking at the result closely.

Here is a DEF 14A filing from Salseforce. You can print it to a PDF and then try converting.

https://www.sec.gov/Archives/edgar/data/1108524/000110852425...

Historical filings are still a problem, but hasn’t the SEC required filing in an XML format since the end of 2024?

  • It's not really about SEC filings, though. While we folks on HN would never think of hard copies of invoices, but much of the world still operates this way.

    As mentioned above I have about 200 construction invoices. They are all formatted in a way that doesn't make sense. Most fail both OCR and OpenAI