Comment by madisonmay

5 months ago

pypdfium2 is a great choice and a solid piece of software!

You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.

7 comments

madisonmay

pzo 5 months ago

this still seems GPL. another OCR worth considering is easyOCR [0] (apache license). AFAIK there is not layout detection but they do provide bounding boxes and support many languages also detecting text on many different world objects from images (signpost, etc)

[0] https://github.com/JaidedAI/EasyOCR

nhirschfeld 5 months ago
Yup, easy OCR is good.
My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.
It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.
- cdrini 5 months ago
  
  Where did you find benchmarks for OCR tools? There have been so many OCR engines coming lately, I would love to see benchmarks!
  
  1 reply →
- alex_suzuki 5 months ago
  
  Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR
  Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.
  
  1 reply →

nhirschfeld 5 months ago

interesting!