These PDFs apparently used the “incremental update” feature of PDF, where edits to the document are merely appended to the original file.
It’s easy to extract the earlier versions, for example with a plain text editor. Just search for lines starting with “%%EOF”, and truncate the file after that line. Voila, the resulting file is the respective earlier PDF version.
(One exception is the first %%EOF in a so-called linearized PDF, which marks a pseudo-revision that is only there for technical reasons and isn’t a valid PDF file by itself.)
I see an interesting parallel to how people think about captured encrypted data, and how long that encryption needs to be effective for until technology catches up and can decrypt (by which point, hopefully the decrypted data is worthless). If all of these documents are stored in durable archives, future methodologies may arrive to extract value or intelligence not originally available at the time of capture and disclosure.
It's hilarious the extent to which Adobe Systems's ridiculously futile attempt to chase MS Word features ended up being the single most productive espionage tool of the last quarter century.
I don’t think this was particularly modeled on MS Word. The incremental update feature was introduced with PDF 1.2 in 1996. It allows to quickly save changes without having to rewrite the whole file, for example when annotating a PDF.
Incremental updates are also essential for PDF signatures, since when you add a subsequent signature to a PDF, you couldn’t rewrite the file without breaking previous signatures. Hence signatures are appended as incremental updates.
The "print and scan physical papers back to a PDF of images" technique for final release is looking better and better from an information protection perspective.
> The "print and scan physical papers back to a PDF of images" technique for final release is looking better and better from an information protection perspective.
Note that all (edit: color-/ink-) printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
So I'd be careful to use a printer of any kind if you're not in control of the printer's firmware.
There's lots of tools that started to decode the information hidden in dotcodes, in case you're interested [1] [2] [3]
It's mindboggling how much open-source 3d printing stuff is out there (and I'm grateful for it) but this is completely lacking in the 2d printing world
Thanks for the links but can you share evidence for the "public IP address" claim? Each time I've read this concept (intriguing! possible!), I search for evidence and I can't find any.
The MIC and yellow dots have been studied and decoded by many and all I've ever seen, including at your links, are essentially date + time + serial#.
Don't get me wrong ... stamping our documents with a fingerprint back to our printers and adding date and time is nasty enough. I don't see a need to overstate the scope of what is shared though.
>Note that all printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
I've got a black and white brother printer which uses toner. Is there something similar for this printer?
a better approach is to convert them to jpeg/png. Then convert that to raw BMP, and then share or print that.
A more modern approach for text documents would be to have an LLM read and rephrase, and restructure everything without preserving punctuation and spacing, using a simple encoding like utf-8, and then use the technique above or just take analog pictures of the monitor. The analog (film) part protects against deepfakes and serves as proof if you need it (for the source and final product alike).
There various solutions out there after the leaks that keep happening where documents and confidential information is served/staged in a way that will reveal the person with who it is shared. Even if you copy paste the text into notepad and save it in ascii format, it will reveal you. Off-the-shelf printers are of course a big no-no.
If all else fails, that analog picture technique works best for exfil, but the final thing you share will still track back to you. I bet spies are back to using microfilms these days.
I only say all of that purely out of a fascination into the subject and for the sake of discussion (think like a thief if you want to catch one and all). Ultimately, you shouldn't share private information with unauthorized parties, period. Personal or otherwise. If you, like snowden, feel that all lawful means are exhausted and that is your only option to address some grievance, then don't assume any technique or planning will protect you, if it isn't worth the risk of imprisonment, then you shouldn't be doing it anyways. Assume you will be imprisoned or worse.
It could still be identifiable, for example if the document has been prepared such that the intended recipient's identity is encoded into subtle modulation of the widths of spaces.
There needs to be better tooling for inspecting PDF documents. Right now, my needs are met by using `qpdf` to export QDF [1], but it is just begging for a GUI to wrap around it...
Take a look at the REMNux reverse engineering page for PDF documents (https://docs.remnux.org/discover-the-tools/analyze+documents...). Lots of tools here for looking at malicious PDFs that can be used to inspect/understand even non-malicious documents.
Thank you. The most recent completely new information from the Snowden files is found in Jacob Appelbaum's 2022 thesis[1], in which he revealed information that had not been previously public (not found on any previously published documents and so on). And AFAIK, the most recent new information from the published documents (along with this post) might actually be in our other posts[2], but there might be some others we aren't aware of.
Snowden never had Russia as a destination, the US revoked his passport while he was waiting in a layover. He was stuck in the airport for months. How is it "telling" of anything?
Your comment is indeed very telling. He ended up in Russia because the U.S. revoked his visa while en route to Ecuador so he was forced to live in a Russian airport for 6 weeks.
>It is of course very telling that Snowden ended up in Russia.
Yeah it's almost like you can revoke someone's passport during their layover in Russia and make the people with MAGA-levels of intelligence take the optics at face value through decade long repeated messaging.
If Snowden was a Russian spy, he would've taken the files, given them to Putin, received the largest Datša in the country and we would never have heard from him or the files. Instead, he gave it to journalists who made the call what to release.
> We contacted Ryan Gallagher, the journalist who led both investigations, to ask about the editorial decision to remove these sections. After more than a week, we have not received a response.
Hopefully we'll hear something now that the Christmas holidays are over.
Traditionally an editor would be obligated to review the material and redact info that could be harmful to others. The publisher has distinct liability independent of govt opinion.
Can someone spell out how this is possible? Do pdfs store a complete document version history? Do they store diffs in the metadata? Does this happen each time the document is edited?
You can replace objects in PDF documents. A PDF is mostly just a bunch of objects of different types so the readers know what to do with them. Each object has a numbered ID. I recommend mutool for decompressing the PDF so you can read it in a text editor:
mutool clean -d in.pdf out.pdf
If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj).
Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits.
1 1 obj
<<
/Type /Pages
/Count 2
/Kids [ 2 0 R 200 0 R ]
>>
endobj
You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).
Thanks for the technical explanation! This is pretty fascinating.
So it works kind of like a soft delete — dereference instead of scrubbing the bits.
Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?
At the bottom of the page there's a link to the pdfresurrect package, whose description says
"The PDF format allows for previous changes to be retained in a revised version of the document, thereby keeping a running history of revisions to the document.
This tool extracts all previous revisions while also producing a summary of changes between revisions."
PDFs are just a table of objects and tree of references to those objects; probably, prior versions of the document were expressed in objects with no references or something like that.
So this is almost certainly redaction by the journalists?
It is disappointing they didn't mark those sections "redacted", with an explanation of why.
It is also disappointing they didn't have enough technical knowhow to at least take a screenshot and publish that rather than the original PDF which presumably still contains all kinds of info in the metadata.
Yes, the journalists did the redactions. The metadata timestamps in one of the documents show that the versions were created three weeks before the publication.
And to be honest, the journalists generally have done a great work on pretty much in all the other published PDFs. We've went through hundreds and hundreds of the published documents, and these two documents were pretty much the only ones which had metadata leak by a mistake revealing something significant (there are other documents as well with metadata leaks/failed redactions, but nothing huge). Our next part will be a technical deep-dive on PDF forensic/metadata analysis we've done.
Are you asking how much was done with pen and paper, and how much of it was done on a computer, i.e. machine assisted? Where do you draw the line? How is "hands-on" in contrast to anything? Is it only "hands-on" when you don't use any tool to assist you?
I suspect you're inquiring about the use of LLMs, and about that I wonder: Why does it matter? Why are you asking?
First thanks for taking my question seriously and not as just a rib and asking a lot of questions in return that I want to consider myself.
By "hands-on" I'm asking whether the provided insight is the product of human intellection. Experienced, capable and qualified. Or at least an earnest attempt at thinking about something and explaining the discoveries in the ways that thinking was done before ChatGPT. For some reason I find myself using phrases involving the hands (etc. hands-on, handmade, hand-spun) as a metaphor for work done without the use of LLMs.
I emphasize insight because I feel like the series of work on the Snowden documents by libroot is wanting in that. I expressed as much the last time their writing hit the front page: <https://news.ycombinator.com/item?id=46566372>. I don't think that that's an implausible claim but I find issue with it being made with such confidence by the anonymous source behind the investigations (I'm withholding ironically putting "investigations" in...nevermind).
If the author actually provided something that advanced to the reader why this information is significant, what to do with or think about it and how they came about discovering the answers to the aforementioned 'why' and ‘what’ and additionally why they’re word ought to matter to us at all, I'd be less inclined to speculate that this is just someone vibe sleuthing their way through documents that on the surface are only significant to the public as the claim "the government is spying on you" is.
This particular post uncovers some nice information. It's a great find. I'm in no position to investigate whether it was already known. But what are we supposed to learn from it aside from "one of the documents were changed before it was made public". What's significant about the redaction? Is Ryan Gallagher responsible? Or does he know who is. Is he at all obliged to explain this to a presumably anonymous inquirer? Or is it now the duty of the public to expect an explanation as affected by said anonymous inquirer?
Remember when believing that the government was rife with pedophiles automatically associated you with horn-helmet-wearing insurrectionists?
These PDFs apparently used the “incremental update” feature of PDF, where edits to the document are merely appended to the original file.
It’s easy to extract the earlier versions, for example with a plain text editor. Just search for lines starting with “%%EOF”, and truncate the file after that line. Voila, the resulting file is the respective earlier PDF version.
(One exception is the first %%EOF in a so-called linearized PDF, which marks a pseudo-revision that is only there for technical reasons and isn’t a valid PDF file by itself.)
New OSINT skill unlocked
I see an interesting parallel to how people think about captured encrypted data, and how long that encryption needs to be effective for until technology catches up and can decrypt (by which point, hopefully the decrypted data is worthless). If all of these documents are stored in durable archives, future methodologies may arrive to extract value or intelligence not originally available at the time of capture and disclosure.
1 reply →
It's hilarious the extent to which Adobe Systems's ridiculously futile attempt to chase MS Word features ended up being the single most productive espionage tool of the last quarter century.
I don’t think this was particularly modeled on MS Word. The incremental update feature was introduced with PDF 1.2 in 1996. It allows to quickly save changes without having to rewrite the whole file, for example when annotating a PDF.
Incremental updates are also essential for PDF signatures, since when you add a subsequent signature to a PDF, you couldn’t rewrite the file without breaking previous signatures. Hence signatures are appended as incremental updates.
24 replies →
The "print and scan physical papers back to a PDF of images" technique for final release is looking better and better from an information protection perspective.
> The "print and scan physical papers back to a PDF of images" technique for final release is looking better and better from an information protection perspective.
Note that all (edit: color-/ink-) printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
So I'd be careful to use a printer of any kind if you're not in control of the printer's firmware.
There's lots of tools that started to decode the information hidden in dotcodes, in case you're interested [1] [2] [3]
[1] https://github.com/Natounet/YellowDotDecode
[2] https://github.com/mcandre/dotsecrets
[3] (when I first found out about it in 2007) https://fahrplan.events.ccc.de/camp/2007/Fahrplan/events/197...
That's why I'm (still) waiting on this https://www.crowdsupply.com/open-tools/open-printer
It's mindboggling how much open-source 3d printing stuff is out there (and I'm grateful for it) but this is completely lacking in the 2d printing world
1 reply →
Thanks for the links but can you share evidence for the "public IP address" claim? Each time I've read this concept (intriguing! possible!), I search for evidence and I can't find any.
The MIC and yellow dots have been studied and decoded by many and all I've ever seen, including at your links, are essentially date + time + serial#.
Don't get me wrong ... stamping our documents with a fingerprint back to our printers and adding date and time is nasty enough. I don't see a need to overstate the scope of what is shared though.
1 reply →
>Note that all printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
I've got a black and white brother printer which uses toner. Is there something similar for this printer?
5 replies →
If you have a UV flashlight, these dots are visible with decent vision.
And of course we have to include the Wikipedia entry:
https://en.wikipedia.org/wiki/Printer_tracking_dots
1 reply →
Could this be circumvented by randomly (or not-so-randomly) adding single-pixel yellow dots to the data sent to the printer?
1 reply →
a better approach is to convert them to jpeg/png. Then convert that to raw BMP, and then share or print that.
A more modern approach for text documents would be to have an LLM read and rephrase, and restructure everything without preserving punctuation and spacing, using a simple encoding like utf-8, and then use the technique above or just take analog pictures of the monitor. The analog (film) part protects against deepfakes and serves as proof if you need it (for the source and final product alike).
There various solutions out there after the leaks that keep happening where documents and confidential information is served/staged in a way that will reveal the person with who it is shared. Even if you copy paste the text into notepad and save it in ascii format, it will reveal you. Off-the-shelf printers are of course a big no-no.
If all else fails, that analog picture technique works best for exfil, but the final thing you share will still track back to you. I bet spies are back to using microfilms these days.
I only say all of that purely out of a fascination into the subject and for the sake of discussion (think like a thief if you want to catch one and all). Ultimately, you shouldn't share private information with unauthorized parties, period. Personal or otherwise. If you, like snowden, feel that all lawful means are exhausted and that is your only option to address some grievance, then don't assume any technique or planning will protect you, if it isn't worth the risk of imprisonment, then you shouldn't be doing it anyways. Assume you will be imprisoned or worse.
I suppose I'd just save the pdf to tiff/png then remake back into a pdf from there to avoid printing and scanning?
if really paranoid, I suppose one could run a filter on the image files to make them a bit fuzzy/noisy
I think "Print to PDF" would be easiest
1 reply →
Why not just make screenshoot of every PDF page?
It could still be identifiable, for example if the document has been prepared such that the intended recipient's identity is encoded into subtle modulation of the widths of spaces.
5 replies →
That'd be fun to make Section 508 compliant at mass scale.
Is there a multifunction B&W printer which prints and then automatically positions the paper on the scanner and scans?
Far more straightforward to print a stack, then feed that stack through the copier/scanner.
3 replies →
There needs to be better tooling for inspecting PDF documents. Right now, my needs are met by using `qpdf` to export QDF [1], but it is just begging for a GUI to wrap around it...
[1] https://qpdf.readthedocs.io/en/stable/qdf.html
Take a look at the REMNux reverse engineering page for PDF documents (https://docs.remnux.org/discover-the-tools/analyze+documents...). Lots of tools here for looking at malicious PDFs that can be used to inspect/understand even non-malicious documents.
In what contest do you use that tool? Looks like that page is primarily about editing pdfs using that format rather than inspecting.
Very tempting to fool around with the ideas especially after the Epstein pdf debacle.
This is insightful work, great job.
Recently someone else revisited the Snowden documents and also found more info, but I can't recall the exact details.
Snowden and the archives were absolute gifts to us all. It's a shame he didn't release everything in full though.
Thank you. The most recent completely new information from the Snowden files is found in Jacob Appelbaum's 2022 thesis[1], in which he revealed information that had not been previously public (not found on any previously published documents and so on). And AFAIK, the most recent new information from the published documents (along with this post) might actually be in our other posts[2], but there might be some others we aren't aware of.
[1]: https://www.electrospaces.net/2023/09/some-new-snippets-from...
[2]: Part 2: https://libroot.org/posts/going-through-snowden-documents-pa...
and part 3: https://libroot.org/posts/going-through-snowden-documents-pa...
[flagged]
Snowden never had Russia as a destination, the US revoked his passport while he was waiting in a layover. He was stuck in the airport for months. How is it "telling" of anything?
The best way to fix a problem is to bring it into the light, not pretend it doesn't exist. "Security by obscurity" has been debunked for decades.
If our system is so flawed Snowden's leaks would have blown everything up, maybe the system deserves to be blown up.
Otherwise we're just papering over flaws which likely will be discovered and exploited eventually.
8 replies →
Your comment is indeed very telling. He ended up in Russia because the U.S. revoked his visa while en route to Ecuador so he was forced to live in a Russian airport for 6 weeks.
4 replies →
>It is of course very telling that Snowden ended up in Russia.
Yeah it's almost like you can revoke someone's passport during their layover in Russia and make the people with MAGA-levels of intelligence take the optics at face value through decade long repeated messaging.
If Snowden was a Russian spy, he would've taken the files, given them to Putin, received the largest Datša in the country and we would never have heard from him or the files. Instead, he gave it to journalists who made the call what to release.
If you don't want people to blow the whistle, stop breaking the damn law https://www.theguardian.com/us-news/2020/sep/03/edward-snowd...
13 replies →
Spy on me harder, daddy
Wow the Reddit bots made it to HN. We must be famous now.
> We contacted Ryan Gallagher, the journalist who led both investigations, to ask about the editorial decision to remove these sections. After more than a week, we have not received a response.
Hopefully we'll hear something now that the Christmas holidays are over.
Why are the journalists redacting the docs? That's incredibly puzzling.
Is there something in here so damaging that they refuse to publish it?
Did the government tell them they'd be in trouble if they published it?
Are the journalists the only ones with access to the raw files?
Traditionally an editor would be obligated to review the material and redact info that could be harmful to others. The publisher has distinct liability independent of govt opinion.
1 reply →
Can someone spell out how this is possible? Do pdfs store a complete document version history? Do they store diffs in the metadata? Does this happen each time the document is edited?
You can replace objects in PDF documents. A PDF is mostly just a bunch of objects of different types so the readers know what to do with them. Each object has a numbered ID. I recommend mutool for decompressing the PDF so you can read it in a text editor:
If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj).
Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits.
You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).
Thanks for the technical explanation! This is pretty fascinating.
So it works kind of like a soft delete — dereference instead of scrubbing the bits.
Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?
1 reply →
To put it reaaaaaly simple, a PDF is like a notion document (blocks and bricks) with a git-like object graph?
1 reply →
At the bottom of the page there's a link to the pdfresurrect package, whose description says
"The PDF format allows for previous changes to be retained in a revised version of the document, thereby keeping a running history of revisions to the document.
This tool extracts all previous revisions while also producing a summary of changes between revisions."
Neat!
https://github.com/enferex/pdfresurrect
https://hackerfactor.com/blog/index.php?/archives/1085-A-Typ...
PDFs are just a table of objects and tree of references to those objects; probably, prior versions of the document were expressed in objects with no references or something like that.
In addition to the print paper and scan approach, I do wonder how effective it would be to “Print to XPS” and then “print” that into a PDF.
Its crazy this is just being discovered now.
I think it's likely someone already discovered this. It's just that info is not broadcasted to people who want to comment on this thread.
I wonder if it’s because of all the attention on the Epstein PDF files.
% pdfresurrect -w epsteinfiles.pdf
Anyone tried this?
Weekend project?
[dead]
So this is almost certainly redaction by the journalists?
It is disappointing they didn't mark those sections "redacted", with an explanation of why.
It is also disappointing they didn't have enough technical knowhow to at least take a screenshot and publish that rather than the original PDF which presumably still contains all kinds of info in the metadata.
Yes, the journalists did the redactions. The metadata timestamps in one of the documents show that the versions were created three weeks before the publication.
And to be honest, the journalists generally have done a great work on pretty much in all the other published PDFs. We've went through hundreds and hundreds of the published documents, and these two documents were pretty much the only ones which had metadata leak by a mistake revealing something significant (there are other documents as well with metadata leaks/failed redactions, but nothing huge). Our next part will be a technical deep-dive on PDF forensic/metadata analysis we've done.
Great work, great comment.
Thank you.
[dead]
I have read claims that there were fake documents inserted in those leaks, who aimed at pushing disinformation.
That itself would be a very convenient lie if the disclosures were damaging or embarrassing.
Maybe you should include a source, especially if you're making claims about alleged "disinformation"? :-)
How much of this research and review is hands-on and how much of it is—ahem—machine assisted?
Are you asking how much was done with pen and paper, and how much of it was done on a computer, i.e. machine assisted? Where do you draw the line? How is "hands-on" in contrast to anything? Is it only "hands-on" when you don't use any tool to assist you?
I suspect you're inquiring about the use of LLMs, and about that I wonder: Why does it matter? Why are you asking?
First thanks for taking my question seriously and not as just a rib and asking a lot of questions in return that I want to consider myself.
By "hands-on" I'm asking whether the provided insight is the product of human intellection. Experienced, capable and qualified. Or at least an earnest attempt at thinking about something and explaining the discoveries in the ways that thinking was done before ChatGPT. For some reason I find myself using phrases involving the hands (etc. hands-on, handmade, hand-spun) as a metaphor for work done without the use of LLMs.
I emphasize insight because I feel like the series of work on the Snowden documents by libroot is wanting in that. I expressed as much the last time their writing hit the front page: <https://news.ycombinator.com/item?id=46566372>. I don't think that that's an implausible claim but I find issue with it being made with such confidence by the anonymous source behind the investigations (I'm withholding ironically putting "investigations" in...nevermind).
If the author actually provided something that advanced to the reader why this information is significant, what to do with or think about it and how they came about discovering the answers to the aforementioned 'why' and ‘what’ and additionally why they’re word ought to matter to us at all, I'd be less inclined to speculate that this is just someone vibe sleuthing their way through documents that on the surface are only significant to the public as the claim "the government is spying on you" is.
This particular post uncovers some nice information. It's a great find. I'm in no position to investigate whether it was already known. But what are we supposed to learn from it aside from "one of the documents were changed before it was made public". What's significant about the redaction? Is Ryan Gallagher responsible? Or does he know who is. Is he at all obliged to explain this to a presumably anonymous inquirer? Or is it now the duty of the public to expect an explanation as affected by said anonymous inquirer?
Remember when believing that the government was rife with pedophiles automatically associated you with horn-helmet-wearing insurrectionists?
[flagged]
1 reply →