Comment by cvz
6 days ago
This is moot anyway if the LLM is only used as part of a review process. But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess. There's no way to get around that.
> This is moot anyway if the LLM is only used as part of a review process
Not really, because if plan to perform a `diff` in order to ensure that there are no False Positives in the corrections proposed by your "assistant", you will want one that finds as many True Positives as possible (otherwise, the exercise will be futile, inefficient, if a large number of OCR errors remain). So, it would be good to have some tool that could (in theory, at this stage) be able to find the subtle ones (not dictionary related, not local etc.).
> But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess
It depends on the text. You may be e.g. interested in texts the value of which is in arguments, the development of thought (they will speak about known things in a novel way) - where OCR error has a much reduced scope (and yet you want them clean of local noise). And, if a report comes out from Gallup or similar, with original figures coming from recent research, we can hope that it will be distributed foremostly electronically. Potential helpers today could do more things that only a few years ago (e.g. hunspell).
I tried it! (How very distracted of us not to have tried immediately.) And it works, mostly... I used a public LLM.
The sentence:
> ABG News was founded in 194S after action from the PCC , _ , demanciing pluralist progress 8 vears ear!ier
is corrected as follows, initially:
> ABG News was founded in 1945 after action from the FCC, demanding pluralist progress 8 years earlier
and as you can see it already corrects a number of trivial and non-trivial OCR errors, including recognizing (explicitly in the full response) that it should be the "FCC" to "demand pluralist progress" (not to mention, all of the wrong characters breaking unambiguous words and the extra punctuation).
After a second request, to review its output to "see if the work is complete", it sees that 'ABG', which it was reluctant to correct because of ambiguities ("Australian Broadcasting Corporation", "Autonomous Bougainville Government" etc.), should actually be 'ABC' because of several hints from the rest of the sentence.
As you can see - proof of concept - it works. The one error it (the one I tried) cannot see is that of '8' instead of '3': it knows (it says in a full response) that the FFC acted in 1942, and ABC was created in 1945, but it does not compute the difference internally nor catch the hint that '8' and '3' are graphically close. Maybe one LLM with "explicit reasoning" could do even better and catch also that.
I'm saying it's moot because, if you're just flagging things for review, there's already a more direct and reliable way to do that. The OCR classifier itself outputs a confidence score. The naieve way of just checking that confidence score will work. The OCR classifier has less overall information than an LLM, but the information it has is much more relevant to the task it's doing.
When I have some time in front of a computer, I'll try a side-by side comparison with some actual images.