← Back to context

Comment by diggan

8 hours ago

How do you know that that bot is part of those AI companies? Maybe it's my personal bot you're blocking, should I also not have (indirectly) access to the content?

No. Access to my content is a privilege I grant you. I decide how you get to access it, and via a bot that my setup confuses for an AI crawler belonging to an anti-human AI corporation is not a valid way to access it. Get off my virtual lawn.

  • > No. Access to my content is a privilege I grant you.

    Right, I thought the conversation was about public websites on the public internet, but I think you're talking about this in the context of a private website now? I understand keeping tighter controls if you're dealing with private content you want accessible via the internet for others but not the public.

    • All websites are private (excepting maybe government sites). In most places the internet infrastructure itself is private.

      You're conflating a legal concept that applies to areas that are shared, government owned, paid for by taxes, and the government feels like people should be able to access them.

      The web is closer to a shopping mall. You're on one persons property to access other people's stuff who pay to be there. They set their own rules. If you don't follow those rules you get kicked out, charged with trespassing, and possibly banned from the mall entire.

      AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.

    • You’re literally visiting a service paid for by me. It’s open to the public, but it’s my domain and my server and I get to say “no thank you” to your visit if you don’t behave. You have no innate right to access the content I share.

      Blocking misbehaving IP addresses isn’t new, and is another version of the same principle.

    • This interpretation won't take you that far.

      Crawling-prevention is not new. Many news outlets or biggish websites already was preventing access by non-human agents in various ways for a very long time.

      Now, non-human agents are improved and started to leech everything they can find, so the methods are evolving, too.

      News outlets are also public sites on the public internet.

      Source-available code repositories are also on the public internet, but said agents crawl and use that code, too, backed by fair-use claims.

You can use a honest user string denoting that it's your bot. Some AI companies label their bots transparently, they show up on the logs I keep.

While I understand that you may need a personal bot to crawl or mirror a site, I can't guarantee that I'll grant you access.

I don't like to be that heavy-handed in the first place, but capitalism is making it harder to trust entities which you can't see and talk face to face.