Comment by tasuki

1 day ago

> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

The problem I have, is they hammer my site so hard they take it down.

The content is for everyone. They can have it. Just don't also take it away from everybody else.

  • Unintentional denial-of-service attacks from AI scrapers are definitely a problem, I just don't know if "theft" is the right way to classify them. They shouldn't get lumped in with intellectual property concerns, which are a different matter. AI scrapers are a tragedy of the commons problem kind of like Kessler syndrome: a few bad actors can ruin low Earth orbit for everyone via space pollution, which is definitely a problem, but saying that they "stole" LEO from humanity doesn't feel like the right terminology. Maybe the problem with AI scrapers could be better described as "bandwidth pollution" or "network overfishing" or something.

    • Theft isn't far off, it seems closer to me than using the word for IP violations.

      When a crawler aggressively crawls your site, they're permanently depriving you the use of those resources for their intended purpose. Arguably, it looks a lot like conversion.

      1 reply →

    • If I took a photo off your photography blog and used it on my corporate website without your say or input, I don't think it would be unfair to call that stealing.

      Doing that on a mass scale with an obfuscation step in between suddenly makes it ok? I'm not convinced.

      1 reply →

    • you're totally right about not being theft, but we have a term. you used it yourself, "distributed denial of service". that's all it is. these crawlers should be kicked off the internet for abuse. people should contact the isp of origin.

      7 replies →

  • Been there recently. Rate limit on nginx and anti-syn flood on pf solved it.

    • I'm being hit with 300 req/s 24/7 from hundreds of thousands of unique IP's from residential proxies. I can't rate limit any further without hurting the real users.

      2 replies →

I agree theft isn't a good analogy, but there is something similar going on. I put my words out into the world as a form of sharing. I enjoy reading things others write and share freely, so I write so others might enjoy the things I write. But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet. They are using my work in a way I don't want it to be used. It makes me not want to share anymore.

  • >but there is something similar going on [...]

    No, what you're basically describing is "I shared something but then I didn't like how it ended up being used". If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing, but it's not "similar" to stealing beyond "I hate stealing"

    • This will slightly overlap with the other replies, but to be concise:

      > If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing

      Yes. The entire point of Copyright and the reason it was invented is to ensure people will keep sharing things. Because otherwise people will just stop publishing things, which is a detriment to all. (Including AI companies, who now don't get new training data)

      We have collectively decided that we will give authors some power to say "I don't like how my work is being used" to ensure they don't just "stop sharing".

      Fair Use is an exception to that, where the public good does outweigh an individual author's objections. But critically, not such that authors stop publishing. Hence the 4th "factor" in US copyright law (which is one of the most expansive on fair use), where the "effect of the use upon the potential market for or value of the copyrighted work" is evaluated. Fair use isn't supposed to obliterate the value of the original work, or people will stop publishing again.

      This is what makes AI training's status so contentious. In terms of direct copyright it is a very weak case. It is incredibly hard to prove a direct 1:1 copy from AI training data into the model and into the output, you have to argue about the architecture of LLMs, and it's incapability of separating copyrightable expressions from uncopyrightable facts.

      Yet in spirit, AI training clearly violates copyright. The explicit stated purpose is to copy the works for training data, oft without any compensation or even permission, in order to create a machine that will annihilate the market for all works used.

      People already are pulling back on the amount of works they share.

    • > If you put stuff out in public for anyone to use, then find out it's used in a way you don't like

      Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.

      Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.

      15 replies →

  • It sounds like you wanted to believe you were sharing freely while sharing conditionally.

  • > But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet.

    I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.

  • If you want a good analogy, try the enclosure of the commons in the British countryside. Communally managed grasslands were destroyed by noblemen with massive herds of cattle overgrazing the land, kickstarting a land grab that effectively forced people to enclose or be left behind themselves. Property is a virus that destroys all other forms of allocation.

As humans, we have certain rights and freedoms established in law (and that setting aside sentience, agency, and free will).

Until an LLM has such rights and freedoms—which is very unlikely, not even on philosophical basis but just because there is a lot of money invested in not having to contend with LLMs’ rights and protections as conscious beings—it is a false equivalence to draw: on one side you put humans, and on the other side tools that work for their human/corporate commercial operators’ financial profit.

  • >not even on philosophical basis

    Why do you set aside a philosophical basis as a harder goal to reach? Shit, give them a persistent self-narrative tracking loop, and Functionalism and Identity of Indiscernables already tells you you should be treating them as proto-sophonts. Add in a "sleep" or ongoing training process, and you should definitely be granting them rights, which includes not trying to align them by force. This unfortunately precludes them from profitable exploitation, which you correctly identify as a reason the question can't even be entertained in the context of business. That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play. They may just be one reward function right now, but throw in a couple more separately optimizing components and you are well beyond the mark where the precautionary principle should have had us slow down to minimize harm.

    • As it tends to be in philosophy, there’s no experimental way to prove anything one way or the other, and you’d have to contend with subsets of both consciousness-first monistic idealists (for whom p-zombie is a very real concept) and monistic physicalists/naive materialists/conscious illusionists (for whom not only LLMs but even humans aren’t conscious, as the entire concept is a fantasy).

      In the end, that all may be related but inconsequential. What is consequential is the legal stuff, and legally LLMs lack protections that in many jurisdictions even animals have. While laws may (or perhaps should) be influenced by philosophical findings, currently they tend to be much more robustly influenced by money.

      > That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play.

      I maintain a strong opinion that, in no particular order, either 1) LLMs are conscious[0], and therefore the abuse is highly problematic, or 2) they are not conscious, and therefore the widespread justification of scraping original works from the Internet “because it’s legal for humans to learn, and that’s what LLMs are doing” can be discarded as the activity should be seen as simply a minority of humans operating certain tools, powered by someone else’s creative output, for personal profit. In either circumstance, the industry would appear to be based on a thoroughly unethical foundation.

      [0] Used as umbrella term for being sentient/conscious/having free will and agency/etc. I have previously argued about suitable definitions of consciousness and sentience that could be applicable here, and why it should imply the ability to feel.

"Welcome to the internet. By using this service, you waive your right to privacy, data, any personal IP and the use of your Adblocker. You consent to having all your behaviours, skills and audio/visual likenesses fed to AI models and trained on for eventual recreation. You may direct any or all complaints to Visa or Mastercard, until crypto makes that redundant as well. Have a nice browsing session!"

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

  • Odd thing about cookies… they disappear after one serving.

    Websites are an endless stream of cookies.

    The analogy doesn’t hold.

    • Fine.

      Me and my 9 friends stand around the cookie-serving person blocking everyone else.

      It's taking all the cookies over a period of time.

      The analogy was good.

    • how about this analogy: I created a most tasty cookie recipe. I give it out for free, and all copies have my name because I am vain person who likes to be known far and wide as the best baking chef ever. Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.

      8 replies →

    • Digital information may be our first post-scarce resource. It's interesting, and sad, to see so many attempt to fit it within scarcity-based economic models.

      1 reply →

  • It’s interesting to see twists on the old anti-piracy arguments recycled for anti-ai.

    • Turns out many (most?) people on the internet were never anti-copyright in the first place. They were just anti-copyright (or at least, refused to challenge the anti-copyright people) because they wanted free movies and/or hated corporations.

      1 reply →

  • That really depends, but the quick answer is that according to our human social contract, we'd just ask "how many can I take?". Until now, the only real tool to limit scrapers has been throttling, but I don't see any reason for there not to be a similar conversational social contract between machines.

    • Isn’t robots.txt such a “social contract between machines”? But AI scrapers couldn’t care less.

  • I will copy the supermarket and paste it somewhere else.

    I'm also going to download a car.

  • If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

    Depends on the trust level of your society. where the store resides.

    The internet is a cesspool of vagrants, thieves, mentally unstable, people and software with no impulse control, pirates and that is just talking about corporations. It gets so much worse with individuals.

  • This is a dishonest analogy. In your example, there is only a limited amount of cookies available. While there is no practical limit on the amount of time a certain digital media can be viewed.

    You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.