Comment by A1kmm

6 months ago

If you use nginx to front it, consider something like this in the `http` block of your config:

    map $http_user_agent $bottype {
        default          "";
        "~.*Amazonbot.*" "amazon";
        "~.*ImagesiftBot.*" "imagesift";
        "~.*Googlebot.*" "google";
        "~.*ClaudeBot.*" "claude";
        "~.*gptbot.*" "gpt";
        "~.*semrush.*" "semrush";
        "~.*mj12.*" "mj12";
        "~.*Bytespider.*" "bytedance";
        "~.*facebook.*" "facebook";
    }
    limit_req_zone $bottype zone=bots:10m rate=6r/m;
    limit_req zone=bots burst=10 nodelay;
    limit_req_status 429;

You can still have other limits by IPs. 429s tends to slow the scrapers, and it means you are spending a lot less on bandwidth and compute when they get too aggressive. Monitor and adjust the regex list over time as needed.

Note that if SEO is a goal, this does make you vulnerable to blackhat SEO by someone faking a UA of a search engine you care about and eating their 6 req/minute quota with fake bots. You could treat Google differently.

This approach won't solve for the case where the UA is dishonest and pretends to be a browser - that's an especially hard problem if they have a large pool of residential IPs and emulate / are headless browsers, but that's a whole different problem that needs different solutions.

1 comment

A1kmm

lobsterthief 6 months ago

For Google, just read their publicly-published list of crawler IPs. They’re broken down into 3 JSON files by category. One set of IPs is for GoogleBot (the web crawler), one is for special requests like things from Google Search Console, and one is special crawlers related to things like Google Ads.

You can ingest this IP list periodically and set rules based on those IPs instead. Makes you not prone to the blackhat SEO tactic you mentioned. In fact, you could completely block GoogleBot UA strings that don’t match the IPs, without harming SEO, since those UA strings are being spoofed ;)