Comment by aidenn0

5 days ago

This is my first encounter with Scribe.js; since I have many book scans I always try OCRing them when I see this. Compared to Tesseract (which is the best I have so far), it gets the words right slightly more, but the paragraph segmentation is many times worse. On a book where every paragraph is indented, it reliably decides two consecutive one-line paragraphs are the same paragraph, which is understandable, but a downgrade from Tesseract which gets the paragraph segmentation as correct as possible (It doesn't handle paragraphs that spanpage-breaks, since I'm feeding it one page at a time)

9 comments

aidenn0

zihotki 5 days ago

Scribe is Tesseract. It uses tesseract.js which is a Web Assembly port of Tesseract. So they in theory should be equal. In practice custom settings or older versions could make a difference.

aidenn0 4 days ago

This is only true in the "speed" mode; in the "quality" mode it claims better word recognition than Tesseract on clean scans (which matches my tests): https://github.com/scribeocr/scribe.js/blob/master/docs/scri...
criddell 4 days ago
What's the motivation for doing this in the browser? It seems like intentionally choosing a more difficult path to create an inferior result.
A native MacOS or Windows application could use the OCR facilities of the operating system and, in my experience, both produce results that are far better than Tesseract.
- Zardoz84 4 days ago
  
  Generate the OCR on the fly, in the browser, when you do not have the proper OCR info. As someone that works on public web libraries, I see it useful (but wasteful)

Elucalidavah 5 days ago

> Tesseract (which is the best I have so far)

Have you looked at EasyOCR?

aidenn0 4 days ago
EasyOCR is significantly worse than Tesseract for clean printed text and , while being orders of magnitude slower; far better than Tesseract for low-quality clean scans and extracting text from pictures (e.g. comics), which Tesseract does not as well.
- criddell 4 days ago
  
  Have you tried Abbyy FineReader? It's the best OCR package I've seen.
  
  2 replies →