Comment by zoogeny
17 days ago
Orthogonal to this post, but this just highlights the need for a more machine readable PDF alternative.
I get the inertia of the whole world being on PDF. And perhaps we can just eat the cost and let LLMs suffer the burden going forwards. But why not use that LLM coding brain power to create a better overall format?
I mean, do we really see printing things out onto paper something we need to worry about for the next 100 years? It reminds me of the TTY interface at the heart of Linux. There was a time it all made sense, but can we just deprecate it all now?
PDF does support incorporating information about the logical document structure, aka Tagged PDF. It’s optional, but recommended for accessibility (e.g. PDF/UA). See chapters 14.7–14.8 in [1]. Processing PDF files as rendered images, as suggested elsewhere in this thread, can actually dramatically lose information present in the PDF.
Alternatively, XML document formats and the like do exist. Indeed, HTML was supposed to be a document format. That’s not the problem. The problem is having people and systems actually author documents in that way in an unambiguous fashion, and having a uniform visual presentation for it that would be durable in the long term (decades at least).
PDF as a format persists because it supports virtually every feature under the sun (if authors care to use them), while largely guaranteeing a precisely defined visual presentation, and being one of the most stable formats.
[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
I'm not suggesting we re-invent RDF or any other kind of semantic web idea. And the fact that semantic data can be stored in a PDF isn't really the problem being solved by tools such as these. In many cases, PDF is used for things like scanned documents where adding that kind of metadata can't really be done manually - in fact the kinds of tools suggested in the post would be useful for adding that metadata to the PDF after scanning (for example).
Imagine you went to a government office looking for some document from 1930s, like an ancestors marriage or death certificate. You might want to digitize a facsimile of that using a camera or a scanner. You have a lot of options to store that, JPG, PNG, PDF. You have even more options to store the metadata (XML, RDF, TXT, SQLite, etc.). You could even get fancy and zip up an HTML doc alongside a directory of images/resources that stitched them all together. But there isn't really a good standard format to do that.
It is the second part of you post that stands out - the kitchen sink nature of PDFs that make them so terrible. If they were just wrappers for image data, formatted in a way that made printing them easy, I probably wouldn't dislike them.
I mean, you want to store a kitchen sink of data, too. You don't like the semantic web or semantic metadata, fine - what do you propose? A custom metadata format for each use case? That is semantic information.
If you don't do that, you get a kitchen sink. If you need to store 1930s death certificats, 10k filings, your doctor's signup forms, the ARR graph for your startup, and a genealogy chart all in the same format, kitchen sink it is.
If it were "just a wrapper for image data", what exactly would that wrapper add? Semantic information, or a kitchen sink to manage additional info.
You're asking to store complex data without preserving complexity - I don't think that'll work.
1 reply →
Fixed link: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...