Comment by pronoiac

1 year ago

I made a high-quality scan of PAIP (Paradigms of Artificial Intelligence Programming), and worked on OCR'ing and incorporating that into an admittedly imperfect git repo of Markdown files. I used Scantailor to deskew and do other adjustments before applying Tesseract, via OCRmyPDF. I wrote notes for some of my process over at https://news.ycombinator.com/item?id=43043671 - OCR4all

(Meaning, I have these browser tabs open, I haven't fully digested them yet)

3 comments

pronoiac

lherron 1 year ago

Also this:

https://news.ycombinator.com/item?id=42952605 - Ingesting PDFs and why Gemini 2.0 changes everything

kingkongjaffa 1 year ago

Was technology the right approach here? Is it essentially done now? I couldn’t tell if it was completed entirely.

I can’t help but think a few amateur humans could have read the pdf with their eyes and written the markdown by hand if the OCR was a little sketchy.

pronoiac 1 year ago

It's still in progress! It's looong - about a thousand pages. There's an ebook, but the printed book got more editing.