Comment by jsheard
14 hours ago
> I figured that they have found an (automated) way to imitate Googlebot really well.
If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.
There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.
That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.
I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P
Presumably they are just matching on *Google* and calling it a day.
Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.
> which I've configured to redirect to a paywalled news article.
Which specific site with a paywall?