← Back to context

Comment by whinvik

2 days ago

Curious but how do we take care of non text files. What if we had a lot of PDF files?

There are plenty of vision capable embedding models, you might not need to OCR, and doing so may could improve or hurt performance.

We OCR them with an LLM into markdown. Super expensive and slow but way more reliable than trying to decode insanely structured PDFs that users upload, which often include pages that are images of the text, or diagrams and figures that need to be read.

Really depends on your scale and speed requirements.

You can extract text from PDF files. (there's a number of dedicated models for that, but even the humble pandoc can do it)

Use pymupdf to extract the PDF text. Hell, run that nasty business through an LLM as step-2 to get a beautiful clean markdown version of the text. Lord knows the PDF format is horribly complex!