Comment by Szpadel

1 year ago

I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h

but due to amount of IPs involved this did not have any impact on about if traffic

my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)

14 comments

Szpadel

conradev 1 year ago

My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:

"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"

https://censorbib.nymity.ch/pdf/Alice2020a.pdf

rnewme 1 year ago

This is a very cool read, thanks

trod1234 1 year ago

If you just block the connection, you send a signal that you are blocking it, and they will change it. You need to impose cost per every connection through QoS buckets.

If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).

Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.

That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.

dredmorbius 1 year ago

It's possible to ban at finer granularity, specifically CIDR blocks, using the Routeviews project reverse-DNS lookup:
<https://www.routeviews.org/routeviews/>
That also provides the associated AS, enabling blocking at that level as well, if warranted.

aaomidi 1 year ago

Maybe ban ASNs /s

koito17 1 year ago
This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]
In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.
[1] https://news.ycombinator.com/item?id=26865236
- pixl97 1 year ago
  
  Ya if it's also coming from residences it's probably some kind of botnet

superjan 1 year ago

Why work hard… Train a model to recognize the AI bots!

js4ever 1 year ago
Because you have to decide in less than 1ms, using AI is too slow in that context
- Dylan16807 1 year ago
  
  You can delay the first request from an IP by a lot more than that without causing problems.
- franktankbank 1 year ago
  
  Train with a bdt.
trod1234 1 year ago
This isn't a problem domain that models are capable of solving.
Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).
All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.
- seethenerdz 1 year ago
  
  When all you have is a Markov generator and $5 billion, every problem starts to look like a prompt. Or something like that.