Comment by rudolph9

5 months ago

We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.

https://tika.apache.org/

1 comment

rudolph9

rudolph9 5 months ago

Under the hood tika uses tesseract for ocr parsing. For clarity this all works surprisingly well generally speaking and it’s pretty easy to run your self and order of magnitude cheaper than most services out there.

https://tesseract-ocr.github.io/tessdoc/