Comment by loeg
19 days ago
I'd kind of like to see that claim substantiated a little more. Is it all crawlers that switch to a non-bot UA, or how are they determining it's the same bot? What non-bot UA do they claim?
19 days ago
I'd kind of like to see that claim substantiated a little more. Is it all crawlers that switch to a non-bot UA, or how are they determining it's the same bot? What non-bot UA do they claim?
> Is it all crawlers that switch to a non-bot UA
I've observed only one of them do this with high confidence.
> how are they determining it's the same bot?
it's fairly easy to determine that it's the same bot, because as soon as I blocked the "official" one, a bunch of AWS IPs started crawling the same URL patterns - in this case, mediawiki's diff view (`/wiki/index.php?title=[page]&diff=[new-id]&oldid=[old-id]`), that absolutely no bot ever crawled before.
> What non-bot UA do they claim?
Latest Chrome on Windows.
Thanks.
Presumably they switch UA to Mozilla/something but tell on themselves by still using the same IP range or ASN. Unfortunately this has become common practice for feed readers as well.
I would take anything the author said with a grain of salt. They straight up lied about the configuration of the robots.txt file.
https://news.ycombinator.com/item?id=42551628
How do you know what the contextual configuration of their robots.txt is/was?
Your accusation was directly addressed by the author in a comment on the original post, IIRC
i find your attitude as expressed here to be problematic in many ways
CommonCrawl archives robots.txt
For convenience, you can view the extracted data here:
https://pastebin.com/VSHMTThJ
You are welcome to verify for yourself by searching for “wiki.diasporafoundation.org/robots.txt” in the CommonCrawl index here:
https://index.commoncrawl.org/
The index contains a file name that you can append to the CommonCrawl url to download the archive and view.
More detailed information on downloading archives here:
https://commoncrawl.org/get-started
From September to December, the robots.txt at wiki.diasporafoundation.org contained this, and only this:
>User-agent: * >Disallow: /w/
Apologies for my attitude, I find defenders of the dishonest in the face of clear evidence even more problematic.
3 replies →
What is causing you to be so unnecessarily aggressive?
Liars should be called out, necessarily. Intellectual dishonesty is cancer. I could be more aggressive if it were something that really mattered.
8 replies →