Comment by whinvik

7 months ago

Curious but how do we take care of non text files. What if we had a lot of PDF files?

4 comments

whinvik

Use pymupdf to extract the PDF text. Hell, run that nasty business through an LLM as step-2 to get a beautiful clean markdown version of the text. Lord knows the PDF format is horribly complex!

minimaxir 7 months ago

You can extract text from PDF files. (there's a number of dedicated models for that, but even the humble pandoc can do it)

elliotto 7 months ago

We OCR them with an LLM into markdown. Super expensive and slow but way more reliable than trying to decode insanely structured PDFs that users upload, which often include pages that are images of the text, or diagrams and figures that need to be read.

Really depends on your scale and speed requirements.

luke-stanley 7 months ago

There are plenty of vision capable embedding models, you might not need to OCR, and doing so may could improve or hurt performance.