Comment by armchairhacker
9 months ago
I like the solution in this comment: https://news.ycombinator.com/item?id=42727510.
Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.
I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h
but due to amount of IPs involved this did not have any impact on about if traffic
my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)
My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:
"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"
https://censorbib.nymity.ch/pdf/Alice2020a.pdf
This is a very cool read, thanks
If you just block the connection, you send a signal that you are blocking it, and they will change it. You need to impose cost per every connection through QoS buckets.
If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).
Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.
That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.
It's possible to ban at finer granularity, specifically CIDR blocks, using the Routeviews project reverse-DNS lookup:
<https://www.routeviews.org/routeviews/>
That also provides the associated AS, enabling blocking at that level as well, if warranted.
Maybe ban ASNs /s
This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]
In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.
[1] https://news.ycombinator.com/item?id=26865236
1 reply →
Why work hard… Train a model to recognize the AI bots!
Because you have to decide in less than 1ms, using AI is too slow in that context
2 replies →
This isn't a problem domain that models are capable of solving.
Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).
All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.
1 reply →
Uggh, web crawlers...
8ish years ago, at the shop I worked at we had a server taken down. It was an image server for vehicles. How did it go down? Well, the crawler in question somehow had access to vehicle image links we had due to our business. Unfortunately, the perfect storm of the image not actually existing (can't remember why, mighta been one of those weird cases where we did a re-inspection without issuing new inspection ID) resulted in them essentially DOSing our condition report image server. Worse, there was a bug in the error handler somehow, such that the server process restarted when this condition happened. This had the -additional- disadvantage of invalidating our 'for .NET 2.0, pretty dang decent' caching implementation...
It comes to mind because, I'm pretty sure we started doing some canary techniques just to be safe (Ironically, doing some simple ones were still cheaper than even adding a different web server.... yes we also fixed the caching issue... yes we also added a way to 'scream' if we got too many bad requests on that service.)
When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.
... And then found their own crawlers can't parse their own manifests.
Could you link the source of your crawler library?
It's about 700 lines of the worst Python ever. You do not want it. I would be too embarrassed to release it, honestly.
It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.