Comment by eamag
7 days ago
Love the name!
OCR was discussed here lately several times (https://github.com/Future-House/paper-qa?tab=readme-ov-file#... are using PyMuPDF. My experience with Tesseract is pretty sad, it's usually not good enough and modern LLMs are better.
Thanks, I'll check these links.
In my tests I found tesseract quite good for regular text documents. For other kinds of texts it's not great.
As for using models - there are some good small language models as well, and of course LLMs.
I sorta feel though that if one needs complex OCR, or a vision model for layout, one should opt for either a commercial solution that abstracts the deployment and GPU management, or bake ones own system.
For most use cases involving text documents though, my subjective opinion is that tesseract is sufficient.
Can’t wait for non-Germans to butcher that name.