Comment by kouteiheika

2 days ago

Yes, people are now very pro-IP because it's the big corporations that are pirating stuff and harvesting data en-masse to train their models, and not just some random teenagers in their basements grabbing an mp3 off LimeWire. So now the IP laws, instead of being draconian, are suddenly not adequate.

But what is frustrating to me is that the second order effects of making the law more restrictive will be doing us all a big disfavor. It will not stop this technology, but it will just make it more inaccessible to normal people and put more power into the hands of the big corporations which the "they're stealing our data!" people would like to stop.

Right now I (a random nobody) can go on HuggingFace, download model which is more powerful that anything that was available 6 months ago, and run it locally on my machine, unrestricted and private.

Can we agree that's, in general, a good thing?

So now if you make the model creators liable for misuse of the models, or make the models a derivative work of its training data, or anything along these lines - what do you think will happen? Yep. The model on HuggingFace is gone, and now the only thing you'll have access to is a paywalled, heavily filtered and censored version of it provided by a megacorporation, while the megacorporation itself has internally an unlimited, unfiltered access to that model.

>Can we agree that's, in general, a good thing?

The models come from overt piracy, and are often used to make fake news, slander people, or other illegal content. Sure it can be funny, but the poison fruit from a poison tree is always going to be overt piracy.

I agree research is exempt from copyright, but people cashing in on unpaid artists works for commercial purposes is a copyright violation predating the DMCA/RIAA.

We must admit these models require piracy, and can never be seen as ethical. =3

'"Generative AI" is not what you think it is'

https://www.youtube.com/watch?v=ERiXDhLHxmo

  • > are often used to make fake news, slander people, or other illegal content.

    That's not how these models are used in the the vast majority of cases.

    This argument is like saying "kitchen knives are often used to kill people so we need to ban the sale of kitchen knives". Do some people use kitchen knives to kill? Sure. Does it mean they should be banned because of that?

    > I agree research is exempt from copyright, but people cashing in on unpaid artists works for commercial purposes is a copyright violation predating the DMCA/RIAA. We must admit these models require piracy, and can never be seen as ethical. =3

    So, may I ask - where exactly do you draw the line? For the sake of argument, let's imagine something like this:

        1. I scrape the whole internet onto my disk.
        2. I go through the text, and gather every word bigram, and build a frequency table.
        3. I delete everything I scraped.
        4. I use that frequency table (which, compared to the exabytes of the source text I used to build it, is a couple hundred megabytes at most) to build a text generator.
        5. I profit from this text generator.
    

    Would you consider this unethical too? Because this is essentially how LLMs work, just in a slightly fancier way. On what exact basis do you draw the line between "ethical" and "unethical" here?

    • > 1. I scrape the whole internet onto my disk.

      This is illegal under theft-of-service laws, and a violation of most sites terms-of-service. If these spider scapers respected the robot exclusion standard under its intended use-case for search-engines, than getting successfully sued for overt copyright piracy and quietly settling for billions would seem unfair.

      Note too, currently >52% of the web is LLM generated slop, so any model trained on that output will inherit similar problems.

      > 2. I go through the text, and gather every word bigram, and build a frequency table.

      And when (not if) a copyrighted work is plagiarized without citation it is academic misconduct, IP theft, and an artistic counterfeit. Copyright law is odd, and often doesn't make a distinction about the origin of similar works. Note this part of the law was recently extended to private individuals this year:

      "OpenAI Stole Scarlet Johansson's Voice"

      https://www.youtube.com/watch?v=YhgYMH6n004

      > 3. I delete everything I scraped.

      This doesn't matter if the output violates copyright. Images in jpeg format are compressed in the frequency domain, have been around for ages, and still get people sued or stuck in jail regularly.

      Academic evaluation usually does fall under a fair-use exception, but the instant someone sells or uses IP in some form of trade/promotion it becomes a copyright violation.

      > 4. I use that frequency table

      See above, the how it is made argument is 100% BS. The statistical salience of LLM simply can't prevent plagiarism and copyright violations. This was cited in the original topic links.

      > 5. I profit from this text generator.

      Since this content may inject liabilities into commercial settings, only naive fools will use this in a commercial context. Most "AI" companies lose around $4.50 per new customer, and are a economic fiction driven by some very silly people.

      LLM businesses are simply an unsustainable exploit. Unfortunately they also proved wealthy entities can evade laws through regulatory capture, and settling the legal problems they couldn't avoid.

      I didn't make the rules, but do disagree cleverness supersedes a just rule of law. Have a wonderful day =3