Comment by phplovesong

2 months ago

We need a new license that forbids all training. That is the only way to stop big corporations from doing this.

84 comments

phplovesong

To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.

If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.

rileymat2 2 months ago

It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
justin_murray 2 months ago
This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.
- michaelmrose 2 months ago
  
  It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.
  
  4 replies →
colechristensen 2 months ago
I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
- conartist6 2 months ago
  
  Isn't the court fight on fair use failing pretty hard on the prong that flooding the market with cheap copies eliminates the market for the original work?
LtWorf 2 months ago
Fair use was for citing and so on not for ripping off 100% of the content.
- maxloh 2 months ago
  
  Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
  This principle is also explicitly declared in US law:
  > In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
  https://www.copyrightlaws.com/are-ideas-protected-by-copyrig...
  
  2 replies →

mr_toad 2 months ago

Fair use doesn’t need a license, so it doesn’t matter what you put in the license.

Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.

anticensor 2 months ago

Exclusive or co-exclusive licences can nullify your default fair use in certain jurisdictions.

munchler 2 months ago

By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.

psychoslave 2 months ago
Isn’t it the very reason why we need cleanroom software engineering:
https://en.wikipedia.org/wiki/Cleanroom_software_engineering
- mr_toad 2 months ago
  
  If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
bluefirebrand 2 months ago

There is absolutely no reason that LLMs (or Corporations) should have the same rights as humans
codedokode 2 months ago
Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
- 1gn15 2 months ago
  
  > And a human artist doesn't need to steal million pictures to learn to draw.
  They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
  Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
  
  1 reply →

James_K 2 months ago

Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.

Orygin 2 months ago
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
- tpmoney 2 months ago
  
  In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
- fouronnes3 2 months ago
  
  Not sure why the FSF or any other organization hasn't released a license like this years ago already.
  
  6 replies →
amszmidt 2 months ago
It isn't the difficult, a license that forbids how the program is used is a non-free software license.
"The freedom to run the program as you wish, for any purpose (freedom 0)."
- Orygin 2 months ago
  
  Yet the GPL imposes requirements for me and we consider it free software.
  You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
- helterskelter 2 months ago
  
  Running the program and analyzing the source code are two different things...?
  
  1 reply →
- LtWorf 2 months ago
  
  But training an AI on a text is not running it.
  
  3 replies →
tomrod 2 months ago

Model weights, source, and output.

tensor 2 months ago

So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.

WithinReason 2 months ago

Wouldn't it be still legal to train on the data due to fair use?

gus_massa 2 months ago
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
- justin_murray 2 months ago
  
  Honest question: why don’t you think it is fair use?
  I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
  These agents are just doing a more sophisticated, faster version of that same act.
  
  7 replies →
- LtWorf 2 months ago
  
  Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.
cryptonector 2 months ago

Not if it's an EULA and you make the bot click through an "I agree" button.

conartist6 2 months ago

Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.

themafia 2 months ago
Taken to an extreme:
"Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."
It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.
- d0mine 2 months ago
  
  > cartel continues to profit
  It doesn't follow. The reverse is more likely: If you end prohibition, you end the mafia.

scotty79 2 months ago

We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.

joegibbs 2 months ago

That wouldn't matter too much though - how often do you worry about competitors directly stealing your code? Either it's server-side, or it's obfuscated or it's compiled. Anyway there's never that much stuff that's so special that it needs big legal stuff to prevent it from being copied, and if the LLM produces it you can just use another LLM to copy the same feature. And say it's 99% LLM and 1% human, who's going to know what the 1% is that's not safe to copy?
raincole 2 months ago
It's more or less already the case though. Pure AI-generated works without human touches are not copyrightable.
- LtWorf 2 months ago
  
  We need it to be infecting the rest like GPL does.
  
  11 replies →
palata 2 months ago
But then we would need a way to prove that some code was LLM generated, right?
Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
- chii 2 months ago
  
  you could have the inverse - proof that the code was _not_ LLM generated. It's like a mark of origin/country of origin for produce.
  
  2 replies →
michaelmrose 2 months ago
Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.
- basilgohar 2 months ago
  
  It's done extrajudicially in warzones such as Palestine where hostages are returned from Israeli jails, with missing organs, dead or alive [0].
  [0] https://factually.co/fact-checks/justice/evidence-investigat...
  
  1 reply →

BeFlatXIII 2 months ago

How is that enforceable against the fly-by-night startups?

cryptonector 2 months ago

So an EULA?