Comment by amelius
11 hours ago
Google should just turn every webpage into an image and from there OCR it back into information. That's the only way to filter out all the crap that humans will not see.
11 hours ago
Google should just turn every webpage into an image and from there OCR it back into information. That's the only way to filter out all the crap that humans will not see.
They've been rendering crawled pages using Chromium for many years now. Hidden text does not work as a ranking manipulation tactic.
Aronud 2004 they very likely had something along these lines already in place, probably just running it on a small subset suggested by clever heuristics.
Of course when you start taking the browser apart you can heavily optimize such process.
At some point you could even get so frustrated with existing APIs..