Comment by crazygringo
2 months ago
I think you are misunderstanding. The pipeline is e.g.:
Scan (600 dpi) > MRC (600 dpi) > OCR (600 dpi) > Downsample (150 dpi) > Save to PDF (150 dpi)
The image is saved in raster format at 150 dpi. That's the document, but not at the original scanning resolution. If you performed MRC and OCR at the 150 dpi level, you'd get different/worse results than were originally gotten at 600 dpi. Which is why you always OCR before downsampling, and you downsample for smaller files.
This isn't changing anything about the Snopes article. It just explains why if you run MRC/OCR at the PDF resolution, you won't deterministically reproduce it because it's not the resolution it was originally run at.
You do understand that this OCR is only for being able to search and highlight text? It's not changing what's displayed. That's still the pixels.
I didn't see the original pixels in the document at any resolution though. That's the point.
You don't see the pixels when you zoom in? Try again:
https://obamawhitehouse.archives.gov/sites/default/files/rss...
If you don't see jaggy pixel edges to the letters and form elements, what do you see?
Are you saying that I'm saying that there are no pixels in the document? Like, do you think that I think that scanners have come to operate on pure platonic forms and no longer use the concept of pixels? That would be really cool, wouldn't it. But no, I don't believe that. Hm. Where did this conversation go wrong. I think I was unclear in my last statement. I have yet to see someone show the original scribbles or ink marks that these OCR layers were generated based on. That's what I meant by "destructive". Now, I'm no expert on documents, so you might want to just cut your losses and stop trying to educate me and let me be uneducated in this matter. I'll accept that I don't know what I'm talking about, and reduce my criticisms of this whole thing to pointing out that the explanations don't make sense to me.
3 replies →