← Back to context

Comment by wrp

3 years ago

A while ago, I was helping a student collect 19th century texts for corpus analysis. Since the books were out of copyright, PDFs were downloadable from Google and the Internet Archive. Although the scanned versions from the two sources were equivalent, the OCRed versions were very different. The OCRed texts from Google had a very low error rate and could be easily corrected by hand. The ones from IA were unusable, with many extreme typos on every line.