Comment by danso

3 years ago

Tangential question: When the Wayback machine retroactively excludes a site's content – e.g. a site owner adds a robots.txt that specifies exclusion — the data isn't deleted, right? Just flagged to prevent being found in search again. In other words, if exclusion of KF turns out to be an unnecessary or unwanted (or if researchers want to study the KF data in the future), it's just flipping the flag, right?

Here's an example of a site that was retroactively excluded by webmaster request, but was later (forced?) to remain in the searchable archive:

http://blog.archive.org/2018/04/24/addressing-recent-claims-...

edit: to answer my own question, seems like retroactive exclusion has, at least since 2007, not been interpreted to be a mandate for actual data removal:

https://archive.org/post/133690/robotstxt-only-gives-tempora...

> Hello, I want all old content from immortal ia.com REMOVED permanently from The Wayback Machine. So I read the exclusion policy, placed up a robots.txt file and requested the Alexa bot go to my website. Then checking The Wayback Machine, I got a notice that the site was blocked by the robots.txt file. But after I removed the robots.txt file, the archived pages reappeared. Is there a way to permanently purge all old pages of a website so that they will NEVER reappear in The Wayback Machine? Am I obligated to keep the robots.txt file in place forever?

2 comments

danso

scrollaway 3 years ago

I worked with people at the Internet Archive before. The data is never deleted. Blocking happens, but these folks take data preservation extremely seriously, and there's an exact 0% chance that the data is actually gone in any way.

Even spam stays in.

mmastrac 3 years ago

It's probably there, inaccessible, waiting for future researchers in a more civilized age.