Comment by zihotki

4 months ago

According to what I read in the documentation, it uses Tesseract underneath. I've used Tesseract v3 in the past and it was pain. Tesseract 4 uses LSTM neural net. How good is the performance and quality of the recognition nowadays in v4? Could anyone share his experience?

4 comments

zihotki

graynk 4 months ago

I use paperless-ngx for digitizing all my documents, it also uses Tesseract. The result is not perfect, but more than acceptable, if I scan at 600dpi

oigursh 4 months ago
There's https://github.com/icereed/paperless-gpt as a plugin
- graynk 3 months ago
  
  Local LLMs I've found to not be good enough for OCR (while being a lot more resource hungry), and OpenAI models I want to avoid for privacy reasons. Default tesseract does the job for me, since my only requirements for the results it "I can easily find what I need with full-text search" - I rarely need to actually copy the text from the resulting PDFs

btian 4 months ago

it's fine for simple use cases, but far inferior to the likes of GPT, Gemini or Mistral