← Back to context

Comment by layer8

1 day ago

I don’t think this was particularly modeled on MS Word. The incremental update feature was introduced with PDF 1.2 in 1996. It allows to quickly save changes without having to rewrite the whole file, for example when annotating a PDF.

Incremental updates are also essential for PDF signatures, since when you add a subsequent signature to a PDF, you couldn’t rewrite the file without breaking previous signatures. Hence signatures are appended as incremental updates.

PDF files are for storing fixed (!!) output of printed/printable material. That's where the format's roots are via Postscript, it's where the format found its main success in document storage, and it's the metaphor everyone has in mind when using the format.

PDFs don't change. PDFs are what they look like.

Except they aren't, because Adobe wanted to be able to (ahem) "annotate" them, or "save changes" to them. And Adobe wanted this because they wanted to sell Acrobat to people who would otherwise be using MS Word for these purposes.

And in so doing, Adobe broke the fundamental design paradigm of the format. And that has had (and continues to have, to hilarious effect) continuing security impact for the data that gets stored in this terrible format.

  • When Acrobat came out cross platform was not common. Being able to publish a document that could be opened on multiple platforms was a big advantage. I was using it to distribute technical specifications in the mid 90's. Different pages of these specifications came from, Filemaker, Excel, Word, Mini-Cad, Photoshop, Illustrator, and probably other applications as well. We would combine these into a single PDF file. This simplified version control. This also meant that bidders could not edit the specifications.

    None of that could be accomplished with Word alone. I think you are underestimating the qualities of PDF for distribution of complex documents.

    • > This also meant that bidders could not edit the specifications.

      But they can! That's the bug, PDF is a mutable file format owing to Adobe's muckery. And you made the same mistake that every government redactor and censor (up to and including the ?!@$! NSA per the linked article) has in the intervening decades.

      The file format you thought you were using was a great fit for your problem, and better than MS Word. The software Adobe shipped was, in fact, something else.

      1 reply →

  • It started in the '80s. PostScript was the big deal. It was a printer language, not a document language. It was not limited to “(mostly) text documents”, even though complex vector fonts and even hinting were introduced. For example, you could print some high quality vector graphs in native printer resolution from systems which would never ever get enough memory to rasterise such giant bitmaps, by writing/exporting to PostScript. That's where Adobe's business was. See also NeWS and NeXT.

    However, arbitrary non-trivial PostScript files were of little use to people without a hardware or software rasteriser (and sometimes fonts matching the ones the author had, and sometimes the specific brand of RIP matching the quirks of authoring software, etc.), so it was generally used by people in publishing or near it. PDF was an attempt to make a document distribution format which was more suitable to more common people and more common hardware (remember the non-workstation screen resolutions at the time). I doubt that anyone imagined typical home users writing letters and bulletins in Acrobat, of all things (though it does happen). It would be similar to buying Photoshop to resize images (and waiting for it to load each time). Therefore, competitor to Word it was not. Vice versa, Word file was never considered a format suitable for printing. The more complex the layout and embedded objects, the less likely it would render properly on publisher's system (if Microsoft Office did exist for its architecture at all). Moreover, it lacked some features which were essential for even small scale book publishing.

    Append-only or versioned-indexed chunk-based file formats for things we consider trivial plain data today were common at the time. Files could be too big to rewrite completely each time even without edits, just because of disk throughput and size limits. The system could not be able to load all of the data into memory because of addressing or size limitations (especially when we talk about illustrations in resolutions suitable for printing). Just like modern games only load the objects in player's vicinity instead of copying all of the dozens or hundreds of gigabytes into memory, document viewers had to load objects only in the area visible on screen. Change the page or zoom level, and wait until everything reloads from disk once again. Web browsers, for example, handle web pages of any length in the same fashion. I should also remind you that default editing mode in Word itself in the '90s was not set to WYSIWYG for similar performance reasons. If you look at the PDF object tree, you can see that some properties are set on the level above the data object, and that allows overwriting the small part of the index with the next version to change, say, position without ever touching the chunk in which the big data itself stays (because appending the new version of that chunk, while possible, would increase the file size much more).

    Document redraw speed can be seen in this random video. But that's 1999, and they probably got a really well performing system to record the promotional content. https://www.youtube.com/watch?v=Pv6fZnQ_ExU

    PDF is a terrible format not because of that, but because its “standard” retroactively defined everything from the point of view of Acrobat developer, and skipped all the corner cases and ramifications (because if you are an Acrobat developer, you define what is a corner case, and what is not). As a consequence, unless you are in a closed environment you control, the only practical validator for arbitrary PDFs is Acrobat (I don't think that happened by chance). The external client is always going to say “But it looks just fine on my screen”.

I'm pretty sure you can change various file formats without rewriting the entire file and without using "incremental updates".

  • You can’t insert data into the middle of a file (or remove portions from the middle of a file) without either rewriting it completely, or at least rewriting everything after the insertion point; the latter requires holding everything after the insertion point in memory (or writing it out to another file first, then reading it in and writing it out again).

    PDF is designed to not require holding the complete file in memory. (PDF viewers can display PDFs larger than available memory, as long as the currently displayed page and associated metadata fits in memory. Similar for editing.)

    • While tedious, you can do the rewrite block-wise from the insertion point and only store a an additional block's worth of the rest (or twice as much as you inserted)

      ABCDE, to insert 1 after C: store D, overwrite D with 1, store E, overwrite E with D, write E.

  • No, if you are going to change the structure of a structured document that has been saved to disk, your options are:

    1) Rewrite the file to disk 2) Append the new data/metadata to the end of the existing file

    I suppose you could pre-pad documents with empty blocks and then go modify those in situ by binary editing the file, but that sounds like a nightmare.

  • This was 1996. A typical computer had tens of megabytes of memory with throughput a fraction of what we have today. Appending an element instead of reading, parsing, inserting and validating the entire document is a better solution in so many ways. That people doing redactions don't understand the technology is a separate problem. The context matters.