Comment by bombcar
14 hours ago
Don't we need more than an index of Archive.org because whomever controls the domain could robots.txt these out of existence if they wanted to?
14 hours ago
Don't we need more than an index of Archive.org because whomever controls the domain could robots.txt these out of existence if they wanted to?
It's not about robots.txt but yes, the owners of 538 can just send a cease and desist letter to get them all immediately removed. Many sites that don't want to preserve history have done this already.
Archive.org mostly ignores robots.txt
https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
The robots.txt file should be used to restrict (and, in some cases, slow down) crawling at the time it is being crawled, not for SEO or for restricting access to mirrors or for any other purpose. It should never apply retroactively. (Unfortunately it is sometimes used badly despite this.)
People always use that link as reference to say that Internet Archive ignores robots.txt but it only actually says they are ignoring it for government sites. It suggests that they might do it for other sites in the future (of 2017), but does not actually say that that they have done it.
https://blog.archive.org/2018/04/24/addressing-recent-claims... which is a year later mentions that they have an automated process which is still following robots.txt for displaying old pages where the robots.txt was added later.
https://help.archive.org/help/using-the-wayback-machine/ does say they follow it for scraping, but this is phrased in such a way that would still be true for past sites whether or not they changed the policy. There is a page https://www.sysjolt.com/2021/archive-org-no-longer-honors-ro... which claims they don't follow it, but the site owner misspelled "robots" as "robot".
That first link is confusing; it seems to say they ended up removing the pages not because of a legal threat but because of robots.txt “automated”.
If archive.org can be manipulated to remove content either via legal threats or simple robots.txt it loses a significant portion of its societal value.