Comment by logicprog

3 months ago

> That, or, they could just respect robots.txt

IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.

> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.

How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.

4 comments

logicprog

thewebguyd 3 months ago

> IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.

I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)

It should be wholly possible to publish a site intended for human consumption only.

> How would that be enforceable?

Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.

senko 3 months ago
> It should be wholly possible to publish a site intended for human consumption only.
You have just described the rationale behind DRM. If you think DRM is a net positive for society, I won't stop you, but there has been plenty published online on the anguish, pain and suffering it has wrought.
- logicprog 3 months ago
  
  Precisely, this would be a system that is essentially designed to ensure that your content can only be accessed by specific kinds of users you approve of, for specific kinds of use you approve of, and only with clients and software that you approve of by means of legislation, so that you don't have to go through the hassle of actually setting up the (user hostile) technologies that would be necessary to enforce this otherwise and/or give up the appearance of an open web by requiring sign ins, while just being hostile on another level. It's trying to have your cake and eat it too, and it will only massively strengthen the entire ecosystem of DRM and IP. I also just personally find the idea of posting something on a board in a town square and then trying to decide who gets to look at it ethically repugnant.
  This is actually kind of why I like Anubis. Instead of trying to dictate what clients or purposes or types of users can access a site, it just changes the asymmetry of costs enough that hopefully it fixes the problem. Because like you can still scrape a site behind Anubis, it just takes a little bit more commitment, so it's easier to do it on an individual level than on a mass DoS level.

remexre 3 months ago

> unless that crawl is unreasonably expensive or takes it down for others

This _is_ the problem Anubis is intended to solve -- forges like Codeberg or Forgejo, where many routes perform expensive Git operations (e.g. git blame), and scrapers do not respect the robots.txt asking them not to hit those routes.