Comment by phendrenad2

2 months ago

So this program really doesn't keep the original image of the document as a raster layer? That's kind of surprising, especially if it's used in the legal world. Personally, I'd always want to be able to recover the original document from the OCR layers. Or, are you saying you can? Then you should tell snopes, because it'll make the snopes article a lot shorter if they can just lead with that.

7 comments

phendrenad2

crazygringo 2 months ago

I think you are misunderstanding. The pipeline is e.g.:

Scan (600 dpi) > MRC (600 dpi) > OCR (600 dpi) > Downsample (150 dpi) > Save to PDF (150 dpi)

The image is saved in raster format at 150 dpi. That's the document, but not at the original scanning resolution. If you performed MRC and OCR at the 150 dpi level, you'd get different/worse results than were originally gotten at 600 dpi. Which is why you always OCR before downsampling, and you downsample for smaller files.

This isn't changing anything about the Snopes article. It just explains why if you run MRC/OCR at the PDF resolution, you won't deterministically reproduce it because it's not the resolution it was originally run at.

You do understand that this OCR is only for being able to search and highlight text? It's not changing what's displayed. That's still the pixels.

phendrenad2 2 months ago
I didn't see the original pixels in the document at any resolution though. That's the point.
- crazygringo 2 months ago
  
  You don't see the pixels when you zoom in? Try again:
  https://obamawhitehouse.archives.gov/sites/default/files/rss...
  If you don't see jaggy pixel edges to the letters and form elements, what do you see?
  
  4 replies →