Comment by quuxplusone

1 year ago

Perhaps. My perhaps-curmudgeonly take on that is that it sounds a bit like "Xerox scanners/photocopiers randomly alter numbers in scanned documents" ( https://history.stackexchange.com/questions/50249/why-does-n...

Another, perhaps-leftpaddish argument is that by outsourcing the job to archive.org I'm allowing them to worry about the "best" way to OCR things, rather than spending my own time figuring it out. Wikisource, for example, seems to have gotten markedly better at OCRing pages over the past few years, and I assume that's because they're swapping out components behind the scenes.

1 comment

quuxplusone

huijzer 1 year ago

Fair enough. Very valid points. I guess it boils down to “test both systems and see what works best for the task at hand”. I can indeed imagine cases were your approach would be the better option for sure.