← Back to context

Comment by cratermoon

17 hours ago

There are PDF files and there are PDF files. Many (most?) PDFs I run into are generated from Microsoft Word or some other MS product with no structure at all. The majority of people use MS products don't understand or care about structure. The WYSIWYG imperative means lots of markup to describe font size, color, and decoration, to make every section heading look the same without ever designating the text as a section head. The same happens with paragraphs, page breaks, and column flow. The resulting document looks correct enough to the creator. Other people who have a different version of Word, different fonts, and a thousand other little differences, won't see it correctly. That leads our author to generate a PDF, probably with embedded fonts, to ensure uniform appearance across these thousand little exceptions.

The result is a document with the content mixed up so incomprehensibly with appearance controls as to be both unreadable and without any residue of the underlying intended structure of the document's sections, headers, figures, paragraphs, captions, footnotes, or anything.

And then there's PDF files which are nothing more than a series of images of pages of text. If you're lucky and the scans are clean a good OCR might be able to recover most of the content.

What I'm saying is, it doesn't matter the tool, if authors don't encode structure and formatting in semantically meaningful ways.

So what you are actually saying is that there is a market for a tool that will recreate the PDF with a structure based on how the original PDF looks?

  • The market has been needing a tool like that for 30 years. A PDF document of the type I describe is like a broken egg. Information is lost between the authoring and rendering, to the extent that it's not clear recreating the original is even possible.

    • A typesetter could recreate the document through looking at it, doing some font research, and playing with the kerning for a while. Saying it's not possible to recreate a typeset document that is readable is absurd, no matter how twisted and insane the actual postscript is.