← Back to context

Comment by zoogeny

17 days ago

I'm not suggesting we re-invent RDF or any other kind of semantic web idea. And the fact that semantic data can be stored in a PDF isn't really the problem being solved by tools such as these. In many cases, PDF is used for things like scanned documents where adding that kind of metadata can't really be done manually - in fact the kinds of tools suggested in the post would be useful for adding that metadata to the PDF after scanning (for example).

Imagine you went to a government office looking for some document from 1930s, like an ancestors marriage or death certificate. You might want to digitize a facsimile of that using a camera or a scanner. You have a lot of options to store that, JPG, PNG, PDF. You have even more options to store the metadata (XML, RDF, TXT, SQLite, etc.). You could even get fancy and zip up an HTML doc alongside a directory of images/resources that stitched them all together. But there isn't really a good standard format to do that.

It is the second part of you post that stands out - the kitchen sink nature of PDFs that make them so terrible. If they were just wrappers for image data, formatted in a way that made printing them easy, I probably wouldn't dislike them.

I mean, you want to store a kitchen sink of data, too. You don't like the semantic web or semantic metadata, fine - what do you propose? A custom metadata format for each use case? That is semantic information.

If you don't do that, you get a kitchen sink. If you need to store 1930s death certificats, 10k filings, your doctor's signup forms, the ARR graph for your startup, and a genealogy chart all in the same format, kitchen sink it is.

If it were "just a wrapper for image data", what exactly would that wrapper add? Semantic information, or a kitchen sink to manage additional info.

You're asking to store complex data without preserving complexity - I don't think that'll work.

  • I understand your confusion.

    PDF is terrible because it has grown over time from a format that was originally made for one purpose into a format that is used for too many purposes. That organic growth has caused PDFs to be very difficult to use for a wide variety of use cases.

    That opinion doesn't imply almost anything else that you have claimed I support (and generally do not).