Comment by pfisherman

1 day ago

Can someone spell out how this is possible? Do pdfs store a complete document version history? Do they store diffs in the metadata? Does this happen each time the document is edited?

You can replace objects in PDF documents. A PDF is mostly just a bunch of objects of different types so the readers know what to do with them. Each object has a numbered ID. I recommend mutool for decompressing the PDF so you can read it in a text editor:

    mutool clean -d in.pdf out.pdf

If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj).

    1 0 obj
    <<
      /Type /Pages
      /Count 1
      /Kids [ 2 0 R ]
    >>
    endobj

    2 0 obj
    <<
      /Type /Page
      /Contents 5 0 R
      ...
    >>
    endobj

Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits.

    1 1 obj
    <<
      /Type /Pages
      /Count 2
      /Kids [ 2 0 R 200 0 R ]
    >>
    endobj

You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).

  • Thanks for the technical explanation! This is pretty fascinating.

    So it works kind of like a soft delete — dereference instead of scrubbing the bits.

    Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?

    • The intention is to make editing easy and quick on slow and memory deficient computers. This is how for example editing a pdf with form field values can be so fast. It’s just appending new values for those nodes. If you need to omit edits you’d have to regenerate a fresh pdf from the root.

  • To put it reaaaaaly simple, a PDF is like a notion document (blocks and bricks) with a git-like object graph?

    • Ha! As if anything about Notion is simple.

      But yeah. It's all just objects pointing at each other. It's mostly tree structured, but not entirely. You have a Catalog of Pages that have Resources, like Fonts (that are likely to be shared by multiple pages hence, not a tree). Each Page has Contents that are a stream of drawing instructions.

      This gives you a sense of what it all looks like. The contents of a page is a stack based vector drawing system. Squint a little (or stick it through an LLM) and you'll see Tf switches to Font F4 from the resources at size 14.66, Tj is placing a char at a position etc.

          2 0 obj
          <<
            /Type /Page
            /Resources <<
              /Font <<
                /F4 4 0 R
              >>
            >>
            /Contents 5 0 R
          >>
          endobj
      
          5 0 obj
          <<
            /Length 340
          >>
          stream
          q
          BT
          /F4 14.66 Tf
          1 0 0 -1 0 .47981739 Tm
          0 -13.2773438 Td <002B> Tj
          10.5842743 0 Td <004C> Tj
          ET
          Q...
          endstream
          endobj
      

      I'm going to hand wave away the 100+ different types of objects. But at it's core it's a simple model.

At the bottom of the page there's a link to the pdfresurrect package, whose description says

"The PDF format allows for previous changes to be retained in a revised version of the document, thereby keeping a running history of revisions to the document.

This tool extracts all previous revisions while also producing a summary of changes between revisions."

PDFs are just a table of objects and tree of references to those objects; probably, prior versions of the document were expressed in objects with no references or something like that.