Comment by rito

5 months ago

I have tried a bunch of things. This is what worked best for me: Surya [0]. It can run fully local on your laptop. I also tried EasyOCR [1], which is also quite good. I haven't tried this myself, but I will look at Paddle [2] if the previous two don't float your boat.

All of these are OSS, and you don't need to pay a dime to anyone.

[0]: https://github.com/VikParuchuri/surya

[1]: https://github.com/JaidedAI/EasyOCR

[2]: https://github.com/PaddlePaddle/Paddle

4 comments

__rito__

pmarreck 5 months ago

Got some questions (sorry for necro, but I only discovered this thread by accident because I left it open in a tab and it turns out to be super-relevant to me):

I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)

1) I don't mind destroying the binding to get the best quality. Any idea how I do so?

2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?

3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?

4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?

__rito__ 5 months ago

Hey there, I don't know the answers to most of your question, honestly.
2. I think it would be enough. People do great work with much less.
3. I think Surya would handle it. I have done mostly flat text. I would also try some LLM OCR models like Google Gemini 2.0 Flash with different pipelines. With different system prompts. I am yet to do this. It would be easy to check. About fonts - never really worried about it myself. If it's something fancy, and you are crazy enough, you will create a font. Or you can also use some handwriting mimicry tool using another AI model. I don't have a name on top of my head. Look through OCR models. Indian college and HS kids still have to submit handwritten projects and assignments. Some crafty kids use such tools to type (or chatgpt copy-paste) and then print in pen ink color in their own handwriting, and fool the teacher given there are a large number of assignments to check.
4. I am not sure if I understand the question fully. Do you mean that books' pages will have numbers, and they will be read as book text in your OCRed data? If you mean that, then I just used GOF regex to root page numbers out. When you have the full text without page numbers, there are multiple tools to create EPUBs and PDF's. You can also reformat documents, assuming you already have an EPUB or PDF- based on the target device, using just Calibre.
1. I don't understand the question. You mean any other kind of scan than regular scanning? I don't know at all. I just work with regularly scanned documents.

carlosjobim 5 months ago

I would like to pay a dime and more for any of these solutions discussed in the thread as a normal MacOS program with a graphical user interface.

pmarreck 5 months ago

Wow, Surya looks legit! https://www.datalab.to/

Comment by __rito__

Comment by rito