← Back to context

Comment by mmh0000

1 day ago

Layers.

PDF is an absurdly complex file format. It's part of the reason there is no single "good" PDF reader, just a lot of mediocre PDF readers that are all terrible in their own way. Which is a topic for another day.

There are several ways to remove data in a PDF:

- Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

- Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement. The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

- Then you have the computer illiterate, who think changing the foreground and background color to black is good enough anyway.

This seems highly misleading.

> - Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

Compared to other formats this is actually relatively easy in a PDF since the way the text drawing operators work they don't influence the state for arbitrary other content. A lot of positioning in a PDF is absolute (or relative to an explicitly defined matrix which has hardcoded values). Usually this makes editing a PDF harder (since when changing text the related text does not adapt automatically), but when removing data it makes it much easier since you can mostly just delete it without affecting anything else. (There are exceptions for text immediately after the removed data, but that's limited and relatively easy to control.)

> - Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement.

That's actually rather tricky in PDFs since they usually contain embedded subset fonts and these usually do not have "🮋" as part of the subset. Also doing this would break the layout since "🮋" has a different width than most letters in a typical font, so it would not lead to less formatting issues than the previous option. Unless the "🮋" is stretched for each letter to have the same dimensions, but then the stretched characters allow to recover the text.

> The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

PDF does not have a concept of a background color. If it looks like a background color in PDF, you have a rectangle drawn in one color and something in the foreground color in front of it. What you usually see in badly redacted PDF files is exactly this, but in opposite color: Someone just draws a black box on top of the characters. You could argue that this is smarter since it would still work even if someone would chnage colors, but of course, PDF is a vector format. If you just add a rectangle, someone else can remove it again. (And also copy & paste doesn't care about your rectangle)

>- Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

>- Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement. The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

You're making it sound way harder than it is, when both adobe acrobat and the built-in preview app on mac can both competently redact documents. I'm not aware of instances of either (or any other purpose-made redaction tools) failing. I wouldn't homebrew a python script to do my redaction either, but that doesn't mean doing redactions properly in some insurmountable task for some intern.

  • I would not trust either tool to adequately redact documents, though I'm sure it works under normal levels of scrutiny.

    The most reliable way is to just screenshot the document or print and scan it, effectively burning it down and recreating it in a new format that has no concept of the past. This works across basically all formats, too, and against all tools.

> Then you have the computer illiterate, who think changing the foreground and background color to black is good enough anyway

To be fair, this works if you print out those copies and then re-scan them.

Thanks for this. Really quells the urge I get every so often to just code my own PDF editor, because they all suck and certainly it couldn't be THAT hard. Such hubris!

  • Heh, have at it, here's the full spec: https://developer.adobe.com/document-services/docs/assets/5b...

    Should take... a weekend tops? ;) PDF is crazy and scary

    • > PDF includes eight basic types of objects: Boolean values, Integer and Real numbers, Strings, Names, Arrays, Dictionaries, Streams, and the null object

      Wait, this is more complete than SOAP. It may be a good idea to redo the IPC protocol with a different serialization format!

      1 reply →

    • 7.5.6 "Incremental updates" from the specification is an interesting section too, speaking about accessing data people didn't think to remove from PDF files properly.

  • Don't stop yourself before getting started. I believe in you - maybe you could write the one editor that would actually work!

    Not kidding - it's a ~~~billion dollar market haha

    Make an MVP/Show HN :-)

  • I did a bunch of work creating pdfs using a low-level API, object goes here stuff.

    As far as I understand it, at its core, pdf is just a stream of instructions that is continually modifying the document. You can insert a thousand objects before you start the next word in a paragraph. And this is just the most basic stuff. Anything on a page can be anywhere in the stream. I don't know if you can go back and edit previous pages, you might have a shot at least trying to understand one page at a time.

    Did you know you can have embedded XML in PDFs? You can have a paper form with all the data filled in and include an XML version of that for any computer systems that would like an easier way to read it.

  • Bravo to you for recognising the load-bearing 'just' before you threw it around :)

qpdf has a redaction option. It’s routinely used to anonymize medical records for studies.