Comment by layer8

5 months ago

PDF does support incorporating information about the logical document structure, aka Tagged PDF. It’s optional, but recommended for accessibility (e.g. PDF/UA). See chapters 14.7–14.8 in [1]. Processing PDF files as rendered images, as suggested elsewhere in this thread, can actually dramatically lose information present in the PDF.

Alternatively, XML document formats and the like do exist. Indeed, HTML was supposed to be a document format. That’s not the problem. The problem is having people and systems actually author documents in that way in an unambiguous fashion, and having a uniform visual presentation for it that would be durable in the long term (decades at least).

PDF as a format persists because it supports virtually every feature under the sun (if authors care to use them), while largely guaranteeing a precisely defined visual presentation, and being one of the most stable formats.

[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

4 comments

layer8

zoogeny 5 months ago

I'm not suggesting we re-invent RDF or any other kind of semantic web idea. And the fact that semantic data can be stored in a PDF isn't really the problem being solved by tools such as these. In many cases, PDF is used for things like scanned documents where adding that kind of metadata can't really be done manually - in fact the kinds of tools suggested in the post would be useful for adding that metadata to the PDF after scanning (for example).

Imagine you went to a government office looking for some document from 1930s, like an ancestors marriage or death certificate. You might want to digitize a facsimile of that using a camera or a scanner. You have a lot of options to store that, JPG, PNG, PDF. You have even more options to store the metadata (XML, RDF, TXT, SQLite, etc.). You could even get fancy and zip up an HTML doc alongside a directory of images/resources that stitched them all together. But there isn't really a good standard format to do that.

It is the second part of you post that stands out - the kitchen sink nature of PDFs that make them so terrible. If they were just wrappers for image data, formatted in a way that made printing them easy, I probably wouldn't dislike them.

groby_b 5 months ago
I mean, you want to store a kitchen sink of data, too. You don't like the semantic web or semantic metadata, fine - what do you propose? A custom metadata format for each use case? That is semantic information.
If you don't do that, you get a kitchen sink. If you need to store 1930s death certificats, 10k filings, your doctor's signup forms, the ARR graph for your startup, and a genealogy chart all in the same format, kitchen sink it is.
If it were "just a wrapper for image data", what exactly would that wrapper add? Semantic information, or a kitchen sink to manage additional info.
You're asking to store complex data without preserving complexity - I don't think that'll work.
- zoogeny 5 months ago
  
  I understand your confusion.
  PDF is terrible because it has grown over time from a format that was originally made for one purpose into a format that is used for too many purposes. That organic growth has caused PDFs to be very difficult to use for a wide variety of use cases.
  That opinion doesn't imply almost anything else that you have claimed I support (and generally do not).

layer8 5 months ago

Fixed link: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...