Comment by layer8

1 year ago

Extracting plain text isn’t that much of a problem, relatively speaking. It’s interpreting more complex elements like nested lists, tables, side bars, footnotes/endnotes, cross-references, images and diagrams where things get challenging.

1 comment

layer8

visarga 1 year ago

OCR is not 100% either. Reading order is also fragile, it might OCR the word but mess up the line structure.