Comment by brotchie

2 months ago

You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:

10 comments

brotchie

selinkocalar 2 months ago

As someone who's built an entire business on "anti-screenshots" this is brilliant.

PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.

I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.

embedding-shape 2 months ago

Not to mention some PDF editors preserve previous edits in the PDF file itself, which people also seems unaware of. A bit more user friendly description of the feature without having to read the specification itself: https://developers.foxit.com/developer-hub/document/incremen...

shbooms 2 months ago

often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option

pottertheotter 2 months ago
This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!
- 2ICofafireteam 1 month ago
  
  I have encountered PDFs that would exhibit this behavior in one browser but not in another.
  One fun thing I encountered from local government is releasing files with potato quality resolution and not considering the page size.
  I had a FOI request that returned mainly Arch D sized drawings but they were in a 94 DPI PDF rendered as letter sized. It was a fun conversation trying to explain to an annoyed city employee that putting those large drawings in a 94 DPI letter size page effectively made it 30-ish DPI.
- eviks 2 months ago
  
  Hostile indeed, and also happens in user-facing documents like product manuals!
8note 2 months ago
run some ocr on them after to recreate the text layer?
- albert_e 2 months ago
  
  With the aggressive push of LLMs and Generative AI ..i am expecting a lot of OCR features to become "smarter" by default, namely go beyond mechanical OCR and start inserting hallucinations and sematically/contextually "more correct" information in OCR output
  It's not hard to imagine some powerful LLMs being able to undo some light redactions that are deducible based on context
  
  1 reply →