Comment by ctippett

2 days ago

Am I correct that this has come about because archive.org respects robots.txt and these sites have blocked their crawler from indexing their sites?

I'm not sure how to articulate my thoughts on this exactly, other than to say it's disappointing that doing the right thing (i.e. respecting robots.txt) is rewarded with the burden of soliciting responses to a petition while at the same time others are rewarded with profit for ignoring those same directives.

38 comments

ctippett

Paracompact 2 days ago

Don't know if it helps your musings at all, but there's a good chance that if a high-profile crawler like archive.org disrespected their robots.txt, that archive.org would be faced with lawsuits (or some other form of pressure). This is not merely the most moral move; rather it is the only sensible move.

The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.

GolfPopper 2 days ago
>pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating
I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.
"Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report" (2024) https://www.tomshardware.com/tech-industry/artificial-intell...
- Paracompact 2 days ago
  
  Fair point. Being small and shadowy is a sufficient condition to avoid litigation, but not a necessary one. Another sufficient condition is having billions of dollars to throw around. Unfortunately, archive.org is well known, well loved, and fundamentally harmless.
  
  1 reply →
- ryandrake 2 days ago
  
  Side note: You probably mean "flout" instead of "flaunt."
- Wowfunhappy 2 days ago
  
  But AI companies don’t publicly redistribute the content they scrape, whereas Internet Archive does.
  Even if you believe what the AI companies are doing is or should be a copyright violation, the Internet Archive is redistributing in a more direct manner.

cmeacham98 2 days ago

Correct. Example snippet from the nytimes.com robots.txt:

    User-agent: archive.org_bot
    Disallow: /

mjmas 2 days ago
Is there a difference between that and User-agent: ia_archiver ?
- keane 2 days ago
  
  That was operated by Alexa Internet and the results powered the Internet Archive (same founder) presumably until their acquisition: https://web.archive.org/web/20140528103516/https://alexa.zen...
  https://en.wikipedia.org/wiki/Alexa_Internet
joecool1029 2 days ago
Which they don’t respect. I’ve had it for my blog for years and they still added it to wayback machine, see my last comment for their official announcement of the ignore robots.txt policy, it is not new.
- socalgal2 2 days ago
  
  robots.txt means they shouldn't auto-scan your site. Any user though can go to the wayback machine and type in a URL and the wayback machine will read that URL. That was the intent of robots.txt (don't scan) not (don't read period). It's spelled out in the spec for robots.txt
  
  2 replies →
- ninjagoo 2 days ago
  
  > I’ve had it for my blog for years
  Just out of curiosity, why don't you want your public blog archived? not questioning, just trying to understand the logic/motivations?
  Also, I think you're being unfairly downvoted.

joecool1029 2 days ago

No, archive.org does NOT respect robots.txt. You need to reach out to them directly and ask your site not be included: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

input_sh 2 days ago
Aren't you choosing to ignore something very specific specified in that article? Why do you make it seem that article implies it's their overall policy?
> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).
- joecool1029 2 days ago
  
  > Aren't you choosing to ignore something very specific specified in that article?
  Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”
  The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.
LocalH 19 hours ago

I'd rather they disregard robots.txt than the opposite situation, where someone does not use robots.txt on a domain to allow IA to archive it, then for whatever reason the domain lapsed and got swooped up by a parker who then subsequently adds a robots.txt blocking IA from the whole site, which would have then caused IA to remove all historical archives of that domain from public view.

Gigachad 2 days ago

It's because they want to restrict AI companies from stealing content, but they can't do it if internet archive proxies it all for them.

All of the LLMs would be massively less useful if it wasn't for scraping the latest news.

stephen_g 2 days ago
LLMs have other ways of accessing the content, they don’t need the Web Archive.
Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.
- overfeed 2 days ago
  
  > LLMs have other ways of accessing the content, they don’t need the Web Archive.
  What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.
  Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.
  
  2 replies →
- Gigachad 2 days ago
  
  The legal implications would be different vs scraping publicly available content.
  
  5 replies →
switzer 2 days ago

LLMs would then license content from news orgs and other publishers, which is what should happen.
userbinator 2 days ago
"stealing" is BS because the original still exists. Copyright infringement is more correct.
- jasonfarnon 2 days ago
  
  they're stealing page views
- Gigachad 2 days ago
  
  You can call it whatever you want but it’s killing journalism when LLMs can automatically scrape and reword all the news. Sucking up the profits without contributing anything back to the people who created the work.
  
  3 replies →

userbinator 2 days ago

It's the same idiocy that DRM created.

Be a pirate, because a pirate is free...