Comment by joecool1029

2 days ago

No, archive.org does NOT respect robots.txt. You need to reach out to them directly and ask your site not be included: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

3 comments

joecool1029

input_sh 2 days ago

Aren't you choosing to ignore something very specific specified in that article? Why do you make it seem that article implies it's their overall policy?

> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).

joecool1029 2 days ago

> Aren't you choosing to ignore something very specific specified in that article?
Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”
The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.

LocalH 21 hours ago

I'd rather they disregard robots.txt than the opposite situation, where someone does not use robots.txt on a domain to allow IA to archive it, then for whatever reason the domain lapsed and got swooped up by a parker who then subsequently adds a robots.txt blocking IA from the whole site, which would have then caused IA to remove all historical archives of that domain from public view.