Comment by WesolyKubeczek
3 days ago
You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.
3 days ago
You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.
Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
[1]: https://pod.geraspora.de/posts/17342163
OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
> why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
That's in fact what I was asking: I've only seen traffic from these kind of companies and I've easily blocked them without an annoying PoW scheme.
I have yet to see any of these bad actors and I'm interested in knowing who they actually are.
1 reply →
> AI companies use residential proxies
Source:
Source: Cloudflare
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
I do not see the words "residential" or "proxy" anywhere in that article... or any other text that might imply they are using those things. And personally... I don't trust crimeflare at all. I think they and their MITM-as-a-service has done even more/lasting damage to the global Internet and user privacy in general than all AI/LLMs combined.
However, if this information is accurate... perhaps site owners should allow AI/bot user agents but respond with different content (or maybe a 404?) instead, to try to prevent it from making multiple requests with different UAs.
1 reply →
Well yes it is better. It's a page load triggered by a user for their own processing.
If web security worked a little differently, the requests would likely come from the user's browser.