Comment by sgc
14 days ago
99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.
Even with the best OCR, and high resolution scans, you might not get this due to:
- the quality of the original paper documents, and
- the language
I have non-English documents for which I'd love to have 99% accuracy!
Language is often solvable by better dictionaries. I have been forced to make my own dictionaries in the past, that led to similar error rates as more mainstream languages like English. If you are talking about another alphabet like Cyrillic or Arabic etc, that is another problem.
Yup, I'm talking about another alphabet.