Comment by unsnap_biceps

10 months ago

I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.

That counts as barely imho.

I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.

12 comments

unsnap_biceps

joecool1029 10 months ago

Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

SR2Z 10 months ago
IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.
- prinny_ 10 months ago
  
  Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.
  
  3 replies →
- amarcheschi 10 months ago
  
  I also don't think they hit servers repeatedly so much
AnonC 10 months ago
As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.
- dredmorbius 10 months ago
  
  The most recent notice IA have blogged was in 2017, and there's no indication that the service has reversed course on robots.txt since.
  <https://blog.archive.org/?s=robots.txt>

noman-land 10 months ago

This is highly annoying and rude. Is there a complete list of all known bots and crawlers?

jsheard 10 months ago

https://darkvisitors.com/agents
https://github.com/ai-robots-txt/ai.robots.txt