Comment by huijzer

1 year ago

Last time I checked a few months ago, LLMs were more accurate than the OCR that the archive is using. The web archive version is/was not using context to figure out that for example “in the garden was a trge” should be “in the garden was a tree”. LLMs depending on the prompt do this.

2 comments

huijzer

quuxplusone 1 year ago

Perhaps. My perhaps-curmudgeonly take on that is that it sounds a bit like "Xerox scanners/photocopiers randomly alter numbers in scanned documents" ( https://history.stackexchange.com/questions/50249/why-does-n...

Another, perhaps-leftpaddish argument is that by outsourcing the job to archive.org I'm allowing them to worry about the "best" way to OCR things, rather than spending my own time figuring it out. Wikisource, for example, seems to have gotten markedly better at OCRing pages over the past few years, and I assume that's because they're swapping out components behind the scenes.

huijzer 1 year ago

Fair enough. Very valid points. I guess it boils down to “test both systems and see what works best for the task at hand”. I can indeed imagine cases were your approach would be the better option for sure.