Comment by pseudony

5 months ago

Interesting, thanks for sharing :)

Can you speak to how this differs in PDF extraction from, say, pymupdf, pdfplumber, unsloth and so on ?

I know the async part is probably a thing, but when building a RAG I would be brutally focused on the quality of text extraction. Have you noticed an ability to do better than others ?

9 comments

pseudony

nhirschfeld 5 months ago

So, for PDF we need to distinguish between two types of text extraction-

1. Text extraction from a searchable PDF.

2. OCR.

For 1. Kreuzberg uses pypdfium2, which is a python binding for pdfium - the chromium PDF engine. In this regard Kreuzberg has top notch performance. Much faster than miner.six, PDFplumber etc.

Note PyMuPDF has top notch performance but also an AGPL license, and is almost unusable because of this without paying.

For 2. Kreuzberg uses Tesseract, which is very solid. Performance is good, and Kreuzberg utilizes async worker processes to optimize concurrency.

OCR though is a complex world. If what you need is to extract text from standard text documents (broadly speaking), Tesseract and hence Kreuzberg are a good choice.

If what you need is things like layout extraction, hand writing recognition, complete bonding box metadata etc. than you need to use an alternative - commercial one probably.

dleeftink 5 months ago
An oldy but goody for layout extraction is Cermine by Dominika Tkaczyk and colleagues[0]. Java required.
[0]: http://cermine.ceon.pl/about.html
- mdaniel 5 months ago
  
  Also AGPLv3 https://github.com/CeON/CERMINE/blob/cermine-parent-1.13/LIC...
- nhirschfeld 5 months ago
  
  didnt know this!
ilaksh 5 months ago

PaddleOCR layout works, and so do some open source large language vision models

tomcam 5 months ago

What is a RAG?

nhirschfeld 5 months ago
Retrieval Augmented Generation. Its a class of techniques for generating content using LLMs. I'd recommend Googling this.
- tomcam 5 months ago
  
  Was going to reply indignantly that it's hard to google rag and get that answer when I read your comment. Then I did, and it was the first result.
  Apologies!
  
  1 reply →