Comment by phplovesong
2 days ago
We need a new license that forbids all training. That is the only way to stop big corporations from doing this.
2 days ago
We need a new license that forbids all training. That is the only way to stop big corporations from doing this.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.
It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.
4 replies →
I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
Isn't the court fight on fair use failing pretty hard on the prong that flooding the market with cheap copies eliminates the market for the original work?
Fair use was for citing and so on not for ripping off 100% of the content.
Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
This principle is also explicitly declared in US law:
> In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
https://www.copyrightlaws.com/are-ideas-protected-by-copyrig...
2 replies →
Fair use doesn’t need a license, so it doesn’t matter what you put in the license.
Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.
Exclusive or co-exclusive licences can nullify your default fair use in certain jurisdictions.
By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.
Isn’t it the very reason why we need cleanroom software engineering:
https://en.wikipedia.org/wiki/Cleanroom_software_engineering
If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
> And a human artist doesn't need to steal million pictures to learn to draw.
They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
1 reply →
There is absolutely no reason that LLMs (or Corporations) should have the same rights as humans
So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.
Wouldn't it be still legal to train on the data due to fair use?
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
Honest question: why don’t you think it is fair use?
I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
These agents are just doing a more sophisticated, faster version of that same act.
7 replies →
Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.
Not if it's an EULA and you make the bot click through an "I agree" button.
Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.
Taken to an extreme:
"Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."
It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.
> cartel continues to profit
It doesn't follow. The reverse is more likely: If you end prohibition, you end the mafia.
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
Not sure why the FSF or any other organization hasn't released a license like this years ago already.
6 replies →
It isn't the difficult, a license that forbids how the program is used is a non-free software license.
"The freedom to run the program as you wish, for any purpose (freedom 0)."
Yet the GPL imposes requirements for me and we consider it free software.
You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
Running the program and analyzing the source code are two different things...?
1 reply →
But training an AI on a text is not running it.
3 replies →
Model weights, source, and output.
How is that enforceable against the fly-by-night startups?
We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.
That wouldn't matter too much though - how often do you worry about competitors directly stealing your code? Either it's server-side, or it's obfuscated or it's compiled. Anyway there's never that much stuff that's so special that it needs big legal stuff to prevent it from being copied, and if the LLM produces it you can just use another LLM to copy the same feature. And say it's 99% LLM and 1% human, who's going to know what the 1% is that's not safe to copy?
It's more or less already the case though. Pure AI-generated works without human touches are not copyrightable.
We need it to be infecting the rest like GPL does.
11 replies →
But then we would need a way to prove that some code was LLM generated, right?
Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
you could have the inverse - proof that the code was _not_ LLM generated. It's like a mark of origin/country of origin for produce.
2 replies →
Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.
It's done extrajudicially in warzones such as Palestine where hostages are returned from Israeli jails, with missing organs, dead or alive [0].
[0] https://factually.co/fact-checks/justice/evidence-investigat...
1 reply →
So an EULA?