Comment by Szpadel

15 hours ago

I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h

but due to amount of IPs involved this did not have any impact on about if traffic

my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)

My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:

"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"

https://censorbib.nymity.ch/pdf/Alice2020a.pdf

If you just block the connection, you send a signal that you are blocking it, and they will change it. You need to impose cost per every connection through QoS buckets.

If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).

Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.

That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.

Maybe ban ASNs /s

  • This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]

    In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.

    [1] https://news.ycombinator.com/item?id=26865236

Why work hard… Train a model to recognize the AI bots!

  • This isn't a problem domain that models are capable of solving.

    Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).

    All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.

    • When all you have is a Markov generator and $5 billion, every problem starts to look like a prompt. Or something like that.