Comment by Smerity

5 years ago

In the past I had written about my experiences with crawling[1], from accidentally getting banned by Slashdot as a teenager doing linguistic analysis to accidentally DoS'ing a major website to being threatened with lawsuits.

The latter parts of the story were when I was part of Common Crawl, a public good dataset that has seen a great deal of use. During my tenure there I crawled over 2.5 petabytes and 35 billion webpages mostly by myself.

I'd always felt guilty of a specific case as our crawler hit a big name web company (top N web company) with up to 3000 requests per second* and they sent a lovely note that began with how much they loved the dataset but ended with "please stop thrashing our cache or we'll need to ban your crawler". It was difficult to properly fix due to limited engineering resources and as they represented many tens / hundreds of thousands of domains, with some of the domains essentially proxying requests back to them.

Knowing Google hammered you at 120k requests per second down to _only_ 20k per second has assuaged some portion of that guilt.

[1]: https://state.smerity.com/smerity/state/01EAN3YGGXN93GFRM8XW...

* Up to 3000 requests per second as it'd spike once every half hour or hour when parallelizing across a new set of URL seeds but would then decrease, with the crawl not active for all the month

1 comment

Smerity

zxcvbn4038 5 years ago

With some planning we could have accommodated the 120K rps rate and more, but just out of the blue it caused a lot of issues, the database shards for historic information tended to be configured for infrequent access to large amounts of historic data, their access completely thrashed our caches, etc. We did want Google to index us, if there had been an open dialog we could have created a separate path for their traffic that bypassed the cache and we could have brought additional database servers into production to handle the increased load, we even had a real time events feed that updated whenever content was created or updated that we would have given Google free access to that so they could just crawl the changes instead of having to scan the site for updates, but since they would not talk to anyone none of that happened.