I've been inadvertently working on this topic and I'd like to share some findings.
* Do not confuse bots with DDoS. While bot traffic may end up overwhelming your server, your DDoS SaaS will not stop that traffic unless you have some kind of bot protection enabled, for example the product described in post.
* A lot of bots announce themselves via user agents, some don't.
* If you're running an ecom shop with a lot of product pages, expect a large portion of traffic to be bots and scrapers. In our case it was upto 50%, which was surprising.
* Some bots accept cookies and these skew your product analytics.
* We enabled automatic bot protection and a of lot our third party integrations ended up being marked as bots and their traffic was blocked. We eventually turned that off.
* (EDIT) Any sophisticated self implemented bot protection isn't worth the effort for most companies out there. But I have to admit, it's very exciting to think about all the ways to block bots.
What's our current status? We've enabled monitoring to keep a look out for DDoS attempts but we're taking the hit on bot traffic. The data on our the website isn't really private info, except maybe pricing, and we're really unsure how to think about the new AI bots scraping this information. ChatGPT already gives a summary of what our company does. We don't know if that's a good thing or not. Would be happy to hear anyone's thoughts on how to think about this topic.
> If you're running an ecom shop with a lot of product pages, expect a large portion of traffic to be bots and scrapers.
It's crazy; I registered a new website last month, and every day I get around ~200 visitors, for a landing page only! This site is not mentioned or advertised anywhere. The only list where you might find it is in the newly registered domains.
> The only list where you might find it is in the newly registered domains.
No registration anywhere needed, they'll find you, because you have an IP address. I've set up enough machines without any registration and some hours after they got connected, the usual suspects showed up.
And regarding bots: even if machines don't have e.g. PHP installed, they'll see oodles of attempts to access links ending in *.php. That's the place where I liked to offer randomly encrypted linux kernels for them to digest ;-)
It says "Declare your independence", but your independence is exactly what you stand to lose if you channel your traffic through Cloudflare. You already have your independence; don't give it up to those who appeal to desperation to fool you into believing the opposite of what's true.
Does google effectively gets a pass, because they (can) use the same bot to index websites for search and to scrap data for AI models training at the same time?
Google does get a pass, since they use Googlebot to scrape content, but then look at the robots.txt for "Google-Extended" to voluntarily decide if they can use said content for LLM training[1].
I assume Microsoft intends to do the same, given they have Bing and their recent stance on the matter[2].
Google does voluntarily allow robots.txt to be configured such that they will index pages but they promise not to use the content for training, but yeah if Google decided to go rogue then there wouldn't really be anything that site owners could do about it without killing their presence in Googles index.
I find it slightly ironic that they're only able to do this effectively because they've been able to train their own detection model on traffic, mostly from users that have never agreed to anything.
I don't have strong opinions on this either way really, I just found that a bit funny.
There are so many things sites need to protect against these days it’s making independent self hosting quite annoying. As bots get better at hiding, only companies with huge scale like Cloudflare would be able to identify and block them. DDOS/bot providers are unintentionally creating a monopoly
Cloudflare is running the "email spam protection" play that handed the power to Microsoft and Google and made self-hosted email nearly impossible because emails from those domains would end up getting blocked on Outlook and Gmail.
For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.
I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.
First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.
Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.
It'll be so interesting to see what sorts of "biases" future AI models will manifest when they're only trained on a fraction of the web. All any group with an agenda has to do is make their content available for training, with the knowledge/hope that many of those with balancing content will have it blocked. And then there will be increased complaints re said "biases" by the same ones who endorse blocking, without a thought that the issue was amplified by said blocking. And of course use cases for AI will continue to broaden, in most cases without a care for those spouting about "biases". It'll be a wonderful world.
I've been inadvertently working on this topic and I'd like to share some findings.
* Do not confuse bots with DDoS. While bot traffic may end up overwhelming your server, your DDoS SaaS will not stop that traffic unless you have some kind of bot protection enabled, for example the product described in post.
* A lot of bots announce themselves via user agents, some don't.
* If you're running an ecom shop with a lot of product pages, expect a large portion of traffic to be bots and scrapers. In our case it was upto 50%, which was surprising.
* Some bots accept cookies and these skew your product analytics.
* We enabled automatic bot protection and a of lot our third party integrations ended up being marked as bots and their traffic was blocked. We eventually turned that off.
* (EDIT) Any sophisticated self implemented bot protection isn't worth the effort for most companies out there. But I have to admit, it's very exciting to think about all the ways to block bots.
What's our current status? We've enabled monitoring to keep a look out for DDoS attempts but we're taking the hit on bot traffic. The data on our the website isn't really private info, except maybe pricing, and we're really unsure how to think about the new AI bots scraping this information. ChatGPT already gives a summary of what our company does. We don't know if that's a good thing or not. Would be happy to hear anyone's thoughts on how to think about this topic.
> If you're running an ecom shop with a lot of product pages, expect a large portion of traffic to be bots and scrapers.
It's crazy; I registered a new website last month, and every day I get around ~200 visitors, for a landing page only! This site is not mentioned or advertised anywhere. The only list where you might find it is in the newly registered domains.
> The only list where you might find it is in the newly registered domains.
No registration anywhere needed, they'll find you, because you have an IP address. I've set up enough machines without any registration and some hours after they got connected, the usual suspects showed up.
And regarding bots: even if machines don't have e.g. PHP installed, they'll see oodles of attempts to access links ending in *.php. That's the place where I liked to offer randomly encrypted linux kernels for them to digest ;-)
2 replies →
> This site is not mentioned or advertised anywhere. The only list where you might find it is in the newly registered domains.
Well, that's one place already. Another is in the published list of new HTTPS certificates. As such, "not mentioned" doesn't hold true.
1 reply →
It says "Declare your independence", but your independence is exactly what you stand to lose if you channel your traffic through Cloudflare. You already have your independence; don't give it up to those who appeal to desperation to fool you into believing the opposite of what's true.
We are witnessing the last dying breaths of the open internet. Cloudflare in the middle of all traffic, web assembly, etc.
Does google effectively gets a pass, because they (can) use the same bot to index websites for search and to scrap data for AI models training at the same time?
Google does get a pass, since they use Googlebot to scrape content, but then look at the robots.txt for "Google-Extended" to voluntarily decide if they can use said content for LLM training[1].
I assume Microsoft intends to do the same, given they have Bing and their recent stance on the matter[2].
[1] https://developers.google.com/search/docs/crawling-indexing/...
[2] https://www.businesstoday.in/technology/news/story/microsoft...
Google does voluntarily allow robots.txt to be configured such that they will index pages but they promise not to use the content for training, but yeah if Google decided to go rogue then there wouldn't really be anything that site owners could do about it without killing their presence in Googles index.
https://searchengineland.com/google-extended-crawler-432636
I find it slightly ironic that they're only able to do this effectively because they've been able to train their own detection model on traffic, mostly from users that have never agreed to anything.
I don't have strong opinions on this either way really, I just found that a bit funny.
There are so many things sites need to protect against these days it’s making independent self hosting quite annoying. As bots get better at hiding, only companies with huge scale like Cloudflare would be able to identify and block them. DDOS/bot providers are unintentionally creating a monopoly
Or get an unmetered dedicated server with provided DDoS protection and let it sort itself out.
Cloudflare is running the "email spam protection" play that handed the power to Microsoft and Google and made self-hosted email nearly impossible because emails from those domains would end up getting blocked on Outlook and Gmail.
unless you're hosting wordpress on a $5 vps, random bot traffic won't affect your website at all. it's just background radiation
For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.
[0] https://github.com/anthmn/ai-bot-blocker
[1] https://darkvisitors.com/
I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.
Edit: Publishers Target Common Crawl In Fight Over AI Training Data https://www.wired.com/story/the-fight-against-ai-comes-to-a-...
First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.
Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.
[0] https://darkvisitors.com/agents/ccbot
[1] https://github.com/anthmn/ai-bot-blocker/commit/ae0c2c40fd08...
1 reply →
*by channeling all your traffic through Cloudflare.
Surprise surprise... bytespider is at the top of the list.
(Being lazy not Googling) What is bytespider and why “surprise surprise”?
This is tiktoks scraper? How come tiktok does mass scraping of websites?
ByteDance isn't just TikTok. As mentioned in the article, they have their own LLM product called Doubao.
The most straightforward is training data for an LLM.
4 replies →
I don't see the option to enable this on my Pro sites; however, I see it on my free sites.
It'll be so interesting to see what sorts of "biases" future AI models will manifest when they're only trained on a fraction of the web. All any group with an agenda has to do is make their content available for training, with the knowledge/hope that many of those with balancing content will have it blocked. And then there will be increased complaints re said "biases" by the same ones who endorse blocking, without a thought that the issue was amplified by said blocking. And of course use cases for AI will continue to broaden, in most cases without a care for those spouting about "biases". It'll be a wonderful world.