Comment by thewebguyd

3 months ago

> fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it

That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.

6 comments

thewebguyd

logicprog 3 months ago

> That, or, they could just respect robots.txt

IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.

> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.

How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.

thewebguyd 3 months ago
> IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
- senko 3 months ago
  
  > It should be wholly possible to publish a site intended for human consumption only.
  You have just described the rationale behind DRM. If you think DRM is a net positive for society, I won't stop you, but there has been plenty published online on the anguish, pain and suffering it has wrought.
  
  1 reply →
remexre 3 months ago

> unless that crawl is unreasonably expensive or takes it down for others
This _is_ the problem Anubis is intended to solve -- forges like Codeberg or Forgejo, where many routes perform expensive Git operations (e.g. git blame), and scrapers do not respect the robots.txt asking them not to hit those routes.

notatoad 3 months ago

laws are inherently national, which the internet is not. by all means write a law that crawlers need to obey robots.txt, but how are you going to make russia or china follow that law?