Comment by angoragoats
2 days ago
This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.
2 days ago
This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.
These things are so common having some way of dealing with them is basically mandatory if you plan on doing any sort of large scale crawling.
That said, crawlers are fairly bug prone, so misbehaving crawlers is also a relatively common sight. It's genuinely difficult to properly test a crawler, and useless to build it from specs, since the realities of the web are so far off the charted territory, any test you build is testing against something that's far removed from what you'll actually encounter. With real web data, the corner cases have corner cases, and the HTTP and HTML specs are but vague suggestions.
I am aware of all of the things you mention (I've built crawlers before).
My point was only that there are plenty of crawlers that don't operate in the way the parent post described. If you want to call them buggy that's fine.