Comment by nhirschfeld

5 months ago

So, for PDF we need to distinguish between two types of text extraction-

1. Text extraction from a searchable PDF.

2. OCR.

For 1. Kreuzberg uses pypdfium2, which is a python binding for pdfium - the chromium PDF engine. In this regard Kreuzberg has top notch performance. Much faster than miner.six, PDFplumber etc.

Note PyMuPDF has top notch performance but also an AGPL license, and is almost unusable because of this without paying.

For 2. Kreuzberg uses Tesseract, which is very solid. Performance is good, and Kreuzberg utilizes async worker processes to optimize concurrency.

OCR though is a complex world. If what you need is to extract text from standard text documents (broadly speaking), Tesseract and hence Kreuzberg are a good choice.

If what you need is things like layout extraction, hand writing recognition, complete bonding box metadata etc. than you need to use an alternative - commercial one probably.

4 comments

nhirschfeld

dleeftink 5 months ago

An oldy but goody for layout extraction is Cermine by Dominika Tkaczyk and colleagues[0]. Java required.

[0]: http://cermine.ceon.pl/about.html

mdaniel 5 months ago

Also AGPLv3 https://github.com/CeON/CERMINE/blob/cermine-parent-1.13/LIC...
nhirschfeld 5 months ago

didnt know this!

ilaksh 5 months ago

PaddleOCR layout works, and so do some open source large language vision models