Comment by rnhmjoj

6 months ago

I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?

I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.

13 comments

rnhmjoj

mnmalst 6 months ago

Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.

hooverd 6 months ago

less savory crawlers use residential proxies and are indistinguishable from malware traffic

busterarm 6 months ago

Lots of companies run these kind of crawlers now as part of their products.

They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.

There are lots of companies around that you can buy this type of proxy service from.

WesolyKubeczek 6 months ago

You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.

rnhmjoj 6 months ago
Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
[1]: https://pod.geraspora.de/posts/17342163
- nemothekid 6 months ago
  
  OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
  I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
  
  2 replies →
majorchord 6 months ago
> AI companies use residential proxies
Source:
- Macha 6 months ago
  
  Source: Cloudflare
  https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
  Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
  
  3 replies →