Comment by unsnap_biceps
2 months ago
I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.
That counts as barely imho.
I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.
Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.
Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.
3 replies →
I also don't think they hit servers repeatedly so much
As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.
The most recent notice IA have blogged was in 2017, and there's no indication that the service has reversed course on robots.txt since.
<https://blog.archive.org/?s=robots.txt>
This is highly annoying and rude. Is there a complete list of all known bots and crawlers?
https://darkvisitors.com/agents
https://github.com/ai-robots-txt/ai.robots.txt