A case study in PDF forensics: The Epstein PDFs

4 days ago (pdfa.org)

I found this part interesting:

There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.

  • GNOME Desktop users can put this in a Bash script in ~/.local/share/nautilus/ for more convincing looking fake PDF scans, accessible from your right-click menu. I do not recall where I copied it from originally to give credit so thanks, random internet person (probably on Stack Exchange). It works perfectly.

      ROTATION=$(shuf -n 1 -e '-' '')$(shuf -n 1 -e $(seq 0.05 .5))
    
      for pdf in "$@";
        do magick  -density 150 $pdf \
                  -linear-stretch '1.5%x2%' \
                  -rotate 0.4 \
                  -attenuate '0.01' \
                  +noise  Multiplicative \
                  -colorspace 'gray' \
                  "${pdf%.*}-fakescan.${pdf##*.}"
      done

  • The real question is: Which of the documents are the ones that are "simulating" scanned documents, and what political narrative do they reinforce?

    The only reason I can think of for why someone would want to do this is to pass off fraudulent or AI generated images as real.

    • A simpler explanation could be wanting to skip the print->sign->scan ceremony required by some institutions.

    • This. Slip in a few thousand “fakes” with the trove of goods to be able to fabricate a narrative.

    • Another explanation is that it's simply one form of lazy ineffective obfuscation performed by inexperienced relative luddites in an attempt to walk the fine line between complying with the supreme court directive & not releasing anything useful.

      Other investigations into the files have found oddities like redaction of the word "don't" indicating a haphazard find-&-replace approach to redaction, possibly LLM-aided.

      The DOJ/Akamai online hosted search feature is also incomplete - potentially due to some of these "digitally scanned" files not being subject to OCR.

    • > to pass off fraudulent or AI generated images as real.

      Possibly but I don't find it compelling, if only because a significant portion of the media reportage on the files has made claims that are entirely baseless - if there were a narrative to be sold one would expect such reportage to be actively leveraging such fraudulent images.

  • Very interesting. That document in particular seems to be an interview of A. Acosta by the DoJ from 2019. But what reason would the FBI have for pretending it's a scanned document, if it is genuine? Perhaps there's some aspect of Epstein's deal with Acosta that they'd rather not reveal to the public?

    https://www.justice.gov/epstein/files/DataSet%207/EFTA000092...

    • Not that I can speak from personal experience or anything... But somebody on an email chain may have requested a scanned version of the document to ensure there is no metadata and the employee might have found it easier to just flatten the pdf and apply a graphical filter to make the document appear like a scanned document. There might even be a webtool available somewhere to do so, I wouldn't know...

      24 replies →

    • I am only guessing that they had to remove the document from a classified network in a way where data won't possibly leak

  • I mean, I do that all the time when they ask me to print something, sign it, and then scan it.

    Sign a blank paper, scan it, paste the original doc on it. Then keep the scan for future docs.

    • An easier trick I've used is just sign directly on the computer screen over the displayed document with a whiteboard marker and take a photo with my phone.

Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.

  • There was a post on here about a project in stylometry that analyzed HN users comment history. The tool helped find accounts that had an extremely similar writing style to a given account. The site was soon removed due to privacy concerns but many users with multiple account attested to its accuracy

    https://news.ycombinator.com/item?id=33755016

    It turns out stylometry is actually a pretty well-developed field. It makes me wanna write an AI browser assistant that can take my comments and stylize them randomly to make it harder to use these sorts of forensics against me

    • >It makes me wanna write an AI browser assistant that can take my comments and stylize them randomly to make it harder to use these sorts of forensics against me

      The old trick years ago was to translate from English to different language and back (possibly repeating). I'd be curious how helpful it is against stylometry detection?

      3 replies →

    • On the one side it's a shame this tool was removed because it's very interesting, but on the other hand, the main use case would likely abuse and (cyber)stalking.

      That said, best to assume that the various government agencies have tools like this, and better - if you're trying to hide your identity online, don't just change users or go through VPNS/proxies/TOR but change your writing style too.

      (Also I'm convinced most VPNs/ proxies / TOR nodes / public access points are honeypots)

    • A while back the government claimed it had used stylometry to identify Satoshi Nakamoto.

    • I remember using one of these tools and it falsely identified some other account as being mine. Of course, I only have just this account.

  • Stylometry is extremely sophisticated even with simple n-gram analysis. There's a demo of this that can easily pick out who you are on HN just based on a few paragraphs of your own writing, based on N-gram analysis.

    https://news.ycombinator.com/item?id=33755016

    You can also unironically spot most types of AI writing this way. The approaches based on training another transformer to spot "AI generated" content are wrong.

    • > You can also unironically spot most types of AI writing this way.

      I have no idea if specialized tools can reliably detect AI writing but, as someone whose writing on forums like HN has been accused a couple of times of being AI, I can say that humans aren't very good at it. So far, my limited experience with being falsely accused is it seems to partly just be a bias against being a decent writer with a good vocabulary who sometimes writes longer posts.

      As for the reliability of specialized tools in detecting AI writing, I'm skeptical at a conceptual level because an LLM can be reinforcement trained with feedback from such a tool (RLTF instead of RLHF). While they may be somewhat reliable at the moment, it seems unlikely they'll stay that way.

      Unfortunately, since there are already companies marketing 'AI detectors' to academic institutions, they won't stop marketing them as their reliability continues to get worse. Which will probably result in an increasing shit show of false accusations against students.

      5 replies →

    • Hacker News is one of the best places for this, because people write relatively long posts and generally try to have novel ideas. On 4chan, most posts are very short memey quips, so everybody's style is closer to each others than it is to their normal writing style.

    • Funnily this also implies that laundering your writing through an AI is a good way to defeat stylometry. You add in a strong enough signal, and hopefully smooth out the rest.

  • People always claimed this as a data leak vector but I've always been sceptical. Like just writing style and vocabulary is probably extremely shared among too many people to narrow it down much. (How people that you know could have written this reply?) The counter argument is that he had a very specific style in his mail so maybe this is a special case.

    • this is a well-studied field (stylometry). when combining writing styles, vocabulary, posting times, etc. you absolutely can narrow it down to specific people.

      even when people deliberately try to feign some aspects (e.g. switching writing styles for different pseudonyms), they will almost always slip up and revert to their most comfortable style over time. which is great, because if they aren't also regularly changing pseudonyms (which are also subject to limited stylometry, so pseudonym creation should be somewhat randomized in name, location, etc.), you only need to catch them slipping once to get the whole history of that pseudonym (and potentially others, once that one is confirmed).

      9 replies →

  • The writing style is rather interesting. Epstein seems borderline dyslexic, but almost none of the emails I've seen are written in a coherent way, regardless of the sender.

    Either people on that level rarely write anything on their own and have completely forgotten how to construct proper sentences or maybe that just how they communicate. Sort of language internal to the group.

  • > I don't buy the MaxwellHill claims for various reasons

    Why not? Clear motive, matching timeline, mentions of that reddit account in the released FBI documents of her case

    • I was there for the original thread making the connection so I got a very fresh look at the profile. The user was consistently referring to being in dental school. A lot of posting, and not in ways that would influence opinions. Maybe a cover for more secretive mod actions, but it'd be a wastefully excessive cover.

      Other mods knew them personally and were still in contact. The user claims they heard of the rumor and decided not to reactivate for the lulz.

      I am not familiar with the mod side of reddit - couldn't fellow mods audit her mod action logs to find more juicy details we would have heard about by now?

      If Maxwell is indeed a spy and doing what she is claimed to do, it is highly unlikely that she'd put her last name and a reference to her specific family's property in her username. This would be a glaringly arrogant choice for someone who had been groomed from an early age for spycraft, and who had any degree of oversight.

      If she were part of a spy network, they would be highly remiss not to commandeer the account at the time of her arrest to avoid suspicion unless they were completely incompetent.

      I am mostly familiar with cold war espionage so it just doesn't sound like the general MO to me. Unless Opsec or whatever has badly decayed since then. That's not impossible.

      The mentions of the account in the files are from anonymous tips, some of which are highly absurd. They vetted a lot of tips, and I saw no information in the new releases indicating they thought it held water. We've seen the subpoena and IP tracking for the Epstein prison guard whistleblower, but no such thing on this topic.

      1 reply →

  • I'm pretty sure Epstein tried to meet with moot at least once: https://www.jmail.world/search?q=chris+poole

Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.

  • Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably.

    • The performance is OK, takes a couple of seconds at most on my GPU, just the amount of documents to get through that takes time, even with parallelism. The dimension seems fine as it is, as far as I can tell.

  • [flagged]

    • Haven't seen anything particular about that, but lots of the documents with names that were half-redacted contain OCRd text that is completely garbled, but olmocr-2-7b seems to handle it just fine. Unsure if they just had sucky processes or if there is something else going on.

      3 replies →

> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.

hopefully someone is independently archiving all documents

my understanding is that some are being removed

DOJ are technically breakng the law by releasing a heavily moddified "reproduction" of the original files, not the "actual" files. The software they used "OmniPage CSDK 21.1" removes all usefull metadata and any encrypted files if any where stored.

Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).

What would be more interesting: His Bank accounts.

Who paid him?

Who did get paid?

  • Follow the money root cause analysis never reaches the public, although the analysis will impact the real power shift. General public will receive just enough information so that one group of people can hate another group of people.

  • And for sure the DOJ knows this, or can know it if they want.

    • You think the personal lawyers of Donald Trump Pam Bondi and Todd Blanche will follow the money unbiasedly? As well as children's book, The Plot Against the King, author Kash Patel and FBI director? As well as Russian asset herself, Tulsa Gabbard director of National Intelligence want to do anything against their power source?

      12 replies →

  • Apparently he paid Peter Mandelson for UK government information of significant financial significance, which is resulting in him being disgraced for, what, third or fourth time? This time he's even been reported to the police.

Interesting, there are a handful of PDFs in the drop that appear to be an email with a Base64 encoded attachment—inline.

OCR is so bad of course that decoding the Base64 seems futile without a lot of effort.

Example: https://www.justice.gov/epstein/files/DataSet%2011/EFTA02609...

(More mentioned here: https://old.reddit.com/r/Epstein/comments/1qu9az2/theres_unr...)

  • Would a few byte errors break a binary so much as to make it undecodable ?

    • I think it's more than a few bytes error. (I believe this because I spent about 15 minutes on the linked document and came up empty.)

I can't even download the archive, the transmission always terminates just before its finished. Spooky.

Just on the redaction point, I did notice one email that looked correctly redacted but when zoomed in you could see some pixels from a few letters had escaped a little. It might be possible to reverse engineer the email just from that.

> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata

Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?

  • This is speculation but generally rules like this follow some sort of incident. e.g. Someone responds to a FOI request and accidentally discloses more information than desired due to metadata. So a blanket rule is instituted not to use a particular format.

  • Image metadata is the wild west of structured text. The developer of the foremost tool for dealing with it (exiftool) has made 'remove metadata' feature but still disclaims that it is not able to remove everything.

  • Maybe they know more than we do. It may be possible to tamper with files at a deeper level. I wonder if it is also possible to use some sort of tampered compression algorithm that could mark images much like printers do with paper.

    Another guess is that perhaps the step is a part of a multi-step sanitation process, and the last step(s) perform the bitmap operation.

    • I'm not sure about computer image generation but you can (relatively) easily fingerprint images generated by digital cameras due to sensor defects. I'll bet there is a similar problem with PC image generation where even without the EXIF data there is probably still too much side channel data leakage.

So I have been wondering about this ...

Some of the gathered data is shown here, right? Probably not all.

Now ... that's static information though. That's not really an analysis, most definitely not an independent (open ended) analysis. And it will only show a very incomplete part of the full picture.

This is why I think the "release the files" movement, as good as they are, seems incomplete. I'd rather know a lot more about how they operate their networks, getting away involving underage women. How about secret services of other countries? Should that not also be highly important? So why is there not really a larger investigation as well as independent analysis? Those .pdf files alone can not tell the whole picture. That can just be the tip of the iceberg; and it evidently involves other countries too, with Prince Andrew being the most famous here (aka, the UK, but we already saw that other countries also have similar issues where people suddenly had to step away from politics when it was found out they visited the party-locations of Mr. Epstein).

  • Its about showing the public the whole system is corrupt and the wheels of justice have turned into squares if not entirely removed. Its the eulogy of the American justice system and probably the America as a whole. Welcome to the speedy decline

Love the forensic craft here. Worth noting that the 'recoverable redactions' story that went viral was based on older, unrelated DOJ documents — not the EFTA files, which were properly redacted. The misinformation spread faster than anyone could debunk it. Which is kind of its own forensics problem.

These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.

  • why do we count this in "pages" when it's mostly an email dump

    • Based on my random poking around through the latest datasets for a few hours, while there are a bunch of emails, I don't know if it's "mostly" emails.

      That said, in my opinion they are using "pages" as the metric because it makes the number sound huge.

What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc.

Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld.

  • > What is the legal basis for releasing the someone's private files and communications?

    An act of congress, for one.

    Also, AFAIK, federal privacy generally ends at death, as does criminal liability; so releasing government files from a federal investigation after death of the subject is generally within the realm of acceptable conduct.

    • Yes, I forgot about that major part of the story! Still, acts of Congress can't violate Consitutional rights.

      It seems unlikely you lose all rights when you die or it would be chaos - imagine all the secrets people die with that affect everyone they know. An integral part of every estate plan would be incinerating records. Wills do have real power.

      3 replies →

  • I'd assume it was the nature of the case, and that discovery was done with him being dead.

  • Given what we've seen so far, there's probably some very interesting stuff in Clinton's private files and communications. Not to mention the stuff in current president Trump's. Some random journalist, probably not. Unless it's a very wealthy and/or connected journalist like David Brooks...

(2025) just follow hn guideline, impressive voter ring though

  • We're in early February 2025 [edit:2026] and the article was written on Dec 23, 2025, which makes it less than two months old. I think it's ok not to include a year in the submission title in that case.

    I personally understand a year in the submission as a warning that the article may not be up to date.

Stylometry works. I've seen it used it cases where the individual was identified from a group.

One thing that is telling about the Epstein case study is how long it has stayed in public view. Pizzagate, which involved more powerful people, was shut down faster than I've ever seen for anything else. I still remember and have archived the more extreme content it's sick.

  • I probably don’t want to see it, but what kinds of people and activities are in this Pizzagate content?

  • Pizzagate and Epstein are the same thing

    • Pizzagate was 4chan fanfiction. The Epstein files are real enough to have real consequences, although mostly for people outside the US accountability shield.

This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it.

  • HN lets you mark submissions (and comments) as favorites, no need to spam the thread.