Comment by nl
15 hours ago
This is because whoever owns Fivethirtyeight now (ABC?) deleted the whole archive of articles on the site.
15 hours ago
This is because whoever owns Fivethirtyeight now (ABC?) deleted the whole archive of articles on the site.
Don't we need more than an index of Archive.org because whomever controls the domain could robots.txt these out of existence if they wanted to?
It's not about robots.txt but yes, the owners of 538 can just send a cease and desist letter to get them all immediately removed. Many sites that don't want to preserve history have done this already.
Archive.org mostly ignores robots.txt
https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
The robots.txt file should be used to restrict (and, in some cases, slow down) crawling at the time it is being crawled, not for SEO or for restricting access to mirrors or for any other purpose. It should never apply retroactively. (Unfortunately it is sometimes used badly despite this.)
People always use that link as reference to say that Internet Archive ignores robots.txt but it only actually says they are ignoring it for government sites. It suggests that they might do it for other sites in the future (of 2017), but does not actually say that that they have done it.
https://blog.archive.org/2018/04/24/addressing-recent-claims... which is a year later mentions that they have an automated process which is still following robots.txt for displaying old pages where the robots.txt was added later.
https://help.archive.org/help/using-the-wayback-machine/ does say they follow it for scraping, but this is phrased in such a way that would still be true for past sites whether or not they changed the policy. There is a page https://www.sysjolt.com/2021/archive-org-no-longer-honors-ro... which claims they don't follow it, but the site owner misspelled "robots" as "robot".
1 reply →
[flagged]
Please, say that again in comprehensible English.
The ownership relationship was always load-bearing? The journalism in this case was a tenant, I highly recommend that people promote forms of independent journalism?
EDIT: dude have you heard of the s in https, http://johntantalo.com gets flagged.