Comment by chlorion

1 month ago

I self host a small static website and a cgit instance on an e2-micro VPS from Google Cloud, and I have got around 8.5 million requests combined from openai and claude over around 160 days. They just infinitely crawl the cgit pages forever unless I block them!

    (1) root@gentoo-server ~ # egrep 'openai|claude' -c /var/log/lighttpd/access.log
    8537094

So I have lighttpd setup to match "claude|openai" in the user agent string and return a 403 if it matches, and a nftables firewall seutp to rate limit spammers, and this seems to help a lot.

8 comments

chlorion

dang 1 month ago

And those are the good actors! We're under a crawlocalpyse from botnets, er, residential proxies.

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", anyone?

zerocrates 1 month ago

Yeah the flood of these Chrome UAs with every version number under the sun, and a really large portion being *.0.0.0 version numbers, that's what I've tended to experience lately. Also just kind of every browser user agent ever:
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; .NET CLR 3.5.21022)
There were waves of big and sometimes intrusive traffic admitting to being from Amazon, Anthropic, Google, Meta, etc., but those are easy to block or throttle and aren't that big a deal in the scheme of things.

telotortium 1 month ago

It’s unfortunate that you have to resort to this. OpenAI does publish their bot IP addresses at https://platform.openai.com/docs/bots, but Anthropic doesn’t seem to publish the IP addresses of their bots.

zahlman 1 month ago

The third-party hit-counting service I use implies that I'm not getting any of this bot scraping on my GitHub blog.

Is Microsoft doing something to prevent it? Or am I so uncool that even bots don't want to read my content :(

lelanthran 1 month ago
I'm interested in that service and how it works. Link?
- zahlman 1 month ago
  
  It is https://github.com/silentsoft/hits . It works by loading an SVG "shield" file (like the ones you see at the top of GitHub readmes all the time) from their server from a unique URL (you just choose one when you write/render your HTML). The server, implemented in Java, just counts hits to each URL in a database and sends back the corresponding SVG data. There's also a mini dashboard website where you can check basic stats for a given URL (no login required, everyone's hits-per-day stats are just public) and preview styling options for the SVG. For example, for my most recent blog post https://zahlman.github.io/posts/2025/12/31/oxidation/, I configured it such that you can view the stats via https://hits.sh/zahlman.github.io+oxidation/ (note that the trailing slash is required).
  (The about section on GitHub bills the project as "privacy-friendly", which I would say is nonsense as these dashboards are public and their URLs are trivially computed. But it's also hard to imagine caring.)
  
  2 replies →