← Back to context

Comment by zenmac

5 hours ago

Wouldn't it be trivial to just to write a uwf to block the crawler ips?

At time like this really glad we self-hosted.

No, since they're simply too many. For an e-commerce site I work for, we once had an issue where some bad-actor tried to crawl the site to set up scam shops. The list of IPs were way too broad, and the user-agents way too generic or random.

  • Could you not also use an ASN list like https://github.com/brianhama/bad-asn-list and add blocks of IPs to a blocklist (eg. ipset on Linux)? Most of the scripty traffic comes from VPSs.

    • Thanks to widespread botnets, most scrapers fall back to using "residential proxies" the moment you block their cloud addresses. Same load, but now you risk accidentally blocking customers coming from similar net blocks.

      Blocking ASNs is one step of the fight, but unfortunately it's not the solution.

Its not one IP to block. Its thousands! And they're also scatter through different ip networks so no simple cidr block is possible. Oh, and just for the fun, when you block their datacenter ips they switch to hundreds of residential network ips.

Yes, they are really hard to block. In the end I switched to Cloudflare to just so they can handle this mess.

Maybe :-)

But for a small operation, AKA just me, it's one more thing for me to get my head around and manage.

I don't run just one one website or one service.

It's 100s of sites across multiple platforms!

Not sure I could ever keep up playing AI Crawler and IP Whack-A-Mole!

Wouldn't it be trivial to just to write a uwf to block the crawler ips?

Probably more effective would be to get the bots to exclude your IP/domain. I do this for SSH, leaving it open on my public SFTP servers on purpose. [1] If I can get 5 bot owners to exclude me that could be upwards of 250k+ nodes mostly mobile IP's that stop talking to me. Just create something that confuses and craps up the bots. With SSH bots this is trivial as most SSH bot libraries and code are unmaintained and poorly written to begin with. In my ssh example look for the VersionAddendum. Old versions of ssh, old ssh libraries and code that tries to implement ssh itself will choke on a long banner string. Not to be confused with the text banner file.

I'm sure the clever people here could make something similar for HTTPS and especially for GPT/LLM bots at the risk of being flagged "malicious".

[1] - https://mirror.newsdump.org/confuse-some-ssh-bots.html

About 90%+ of bots can not visit this URL, including real people that have disabled HTTP/2.0 in their browser.