Comment by mdp2021

5 months ago

> This is moot anyway if the LLM is only used as part of a review process

Not really, because if plan to perform a `diff` in order to ensure that there are no False Positives in the corrections proposed by your "assistant", you will want one that finds as many True Positives as possible (otherwise, the exercise will be futile, inefficient, if a large number of OCR errors remain). So, it would be good to have some tool that could (in theory, at this stage) be able to find the subtle ones (not dictionary related, not local etc.).

> But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess

It depends on the text. You may be e.g. interested in texts the value of which is in arguments, the development of thought (they will speak about known things in a novel way) - where OCR error has a much reduced scope (and yet you want them clean of local noise). And, if a report comes out from Gallup or similar, with original figures coming from recent research, we can hope that it will be distributed foremostly electronically. Potential helpers today could do more things that only a few years ago (e.g. hunspell).

0 comments

mdp2021

No comments yet

Contribute on Hacker News ↗