← Back to context

Comment by thenthenthen

9 days ago

I asked this question yesterday but did not enough votes. I need to OCR and then translate thousands of pages from historical documents and was wondering if you knew a scriptable app/technique or technology that includes ‘layout recovery’, aka overlaying translated text over the original, like the Safari browser etc. does (not sure the apple vision framework wrapper does this?).

Apple Vision and its wrappers provide bounding boxes for each line of text. That's slightly less convenient than Tesseract which can give you a bounding box for each word, but more than compensated by Apple Vision's better accuracy. I am planning to fudge the word boxes by assuming fixed-width letters and dividing up the overall width so that each word's width is proportional to its share of the total letters on the line.

Once you have those bounding boxes, it's pretty simple to use a library like [1] (Python) or [2] (JavaScript) to add overlay text in the right place. For example, see how [3] does it.

[1] https://pymupdf.readthedocs.io/en/latest/recipes-text.html#h... [2] https://github.com/foliojs/pdfkit [3] https://github.com/eloops/hocr2pdf

FYI the Apple one is best inside the Live Text API which is Swift-only and so some old Python and CLI tools which wrap the older Obj-C APIs may have worse quality (though Live Text doesn't really provide bounding boxes - so what I do is combine its output with bounding box APIs like the old iOS/macOS ones)