Comment by briga

5 months ago

I was actually just working on a project like this to digitize an old manuscript. I used a PDF scanning app (there are plenty, I used Naps32, simple but it works). And then I piped the images into a `tesseract-ocr`. This will extract the text from the image but it won't deal with formatting or obvious typos. For that you're going to want to feed the text into an LLM with some prompt telling the model to correct errors, fix formatting, and provide clean text. Smaller local models (<70b parameters) do not work very well on this task for big documents, but I found ChatGPT's reasoning model does a fine job. My goal is to find a model that can run locally with similar performance.

0 comments

briga

No comments yet

Contribute on Hacker News ↗