Comment by josephg
1 day ago
> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.
So? What duty do web site operators have to be "nice" to people scraping your website?
1 day ago
> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.
So? What duty do web site operators have to be "nice" to people scraping your website?
The Marginalia search engine or archive.org probably don't deserve such treatment--they're performing a public service that benefits everyone, for free. And it's generally not in one's best interests to serve a bunch of garbage to Google or Bing's crawlers, either.
It's not really too big of a problem for a well-implemented crawler. You basically need to define an upper bound both in terms of document count and time for your crawls, since crawler traps are pretty common and have been around since the cretaceous.
If you have such a website, then you will just serve normal data. But it seems perfectly legit to serve fake random gibberish from your website if you want to. A human would just stop reading it.
The point is that not every web crawler is out there to scrape websites.
Unless you define "scrape" to be inherently nefarious - then surely they are? Isn't the definition of a web crawler based on scraping websites?
I think that web scraping is usually understood as the act of extracting information of a website for ulterior self-centered motives. However, it is clear that this ulterior motive cannot be assessed by a website owner. Only the observable behaviour of a data collecting process can be categorized as morally good or bad. While the bad behaving people are usually also the ones with morally wrong motives, one doesn't entail the other. I chose to qualify the bad behaving ones as scrapers, and the good behaving ones as crawlers.
That being said, the author is perhaps concerned by the growing amount of collecting process, which carries a toll on his server, and thus chose to simply penalize them all.