← Back to context

Comment by rudolph9

5 months ago

We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.

https://tika.apache.org/