Comment by phendrenad2

2 months ago

Yeah, the idea that proves it's fraudulent has been debunked, but the alternative hasn't been proven, either. Nobody has named the specific OCR software that does this destructive replacement. It's a case of "well, there's an alterastive theory, and that's good enough" debunking.

10 comments

phendrenad2

Thorrez 2 months ago

I just took a look at the layers. In some cases, e.g. the 2nd letter in the Local Registrar's signature, a single letter is partially in the background layer, and partially in the upper layer.

This is easily explained by the character separation software being not 100% accurate.

It's not at all explained if someone is fraudulently adding text. Why would someone put half of the character in 1 layer and half of the character in a different layer?

crazygringo 2 months ago

Not sure what you mean by destructive replacement, since nothing is destroyed.

So I just looked into this, and it's specifically Mixed Raster Content pipeline (ISO/IEC 16485) used in lots of different scanners. There's no need to find which specific software generated it because it's used by lots of them.

It's a technique used to attempt to isolate font characters of the same size and style as separate layers before OCR-ing to make OCR more accurate.

ABBYY FineReader, for example, is mentioned as producing the exact same type of results. But there's no guarantee that was the actual software because lots of scanning software does it -- it's a general technique. Plus it won't even be deterministically reproducible if it was e.g. scanned and OCR'd at higher resolution and then saved at a lower resolution, as is generally considered best practice for maximizing accuracy while keeping file sizes lower.

https://www.obamaconspiracy.org/2013/01/heres-the-birth-cert...

So this is very much a nothingburger. It's not an "alternative theory", it's a complete and total explanation.

phendrenad2 2 months ago
So this program really doesn't keep the original image of the document as a raster layer? That's kind of surprising, especially if it's used in the legal world. Personally, I'd always want to be able to recover the original document from the OCR layers. Or, are you saying you can? Then you should tell snopes, because it'll make the snopes article a lot shorter if they can just lead with that.
- crazygringo 2 months ago
  
  I think you are misunderstanding. The pipeline is e.g.:
  Scan (600 dpi) > MRC (600 dpi) > OCR (600 dpi) > Downsample (150 dpi) > Save to PDF (150 dpi)
  The image is saved in raster format at 150 dpi. That's the document, but not at the original scanning resolution. If you performed MRC and OCR at the 150 dpi level, you'd get different/worse results than were originally gotten at 600 dpi. Which is why you always OCR before downsampling, and you downsample for smaller files.
  This isn't changing anything about the Snopes article. It just explains why if you run MRC/OCR at the PDF resolution, you won't deterministically reproduce it because it's not the resolution it was originally run at.
  You do understand that this OCR is only for being able to search and highlight text? It's not changing what's displayed. That's still the pixels.
  
  6 replies →