(note; not a lawyer) It depends on if a model is a derivate work from it's source material or not. If yes, then all copyright protections come into force. If not, then the author can't rely on copyright to protect themselves.
My instinct/gut says that an AI model is a derivative work from the training data (in that it quite literally takes training data to produce a new creative output, with the "human addition" being the selection of training data to use), but there's not really clear judgements on it either way for the time being, which leaves room to argue.
The actual methodology used ("isn't an LLM like a computer reading a book for yourself?") is an irrelevant distraction in this regard. Computers aren't people and don't get that sort of protection; they're ultimately tools programmed to do things by humans and as a rule we hold humans responsible when those tools do something bad/wrong. "Computer says no" works on the small scale, but in cases like this, it's not really an adequate defense.
Or rather, that is how it should be; I think the uncomfortable truth here is that we need Congress to make laws to clarify the situation in the favor of society, and Congress does not seem willing to do that.
Doesn't synthetic data complicate this reasoning? If I train a model on synthetic data, which is not protected by copyright, I am free to do as I please. It won't even regurgitate the originals, it will learn the abstractions not memorize the exact expression, because it doesn't see it.
But it's not just supervised training. Maybe a model trained on reasoning traces and RLHF is not a mere derivative of the training set. All recent models are being trained on self generated data produced with reward or preference models.
When a model trains on a piece of text it won't derive gradients from the parts it knows, it will only absorb the novel parts. So what it takes from each example depends on the ordering of training examples. It is a process of diffing between model and text, could be seen as a form of analysis not simple memorization.
Even if it is infringement to train on protected works, the model size is 100x up to 1000x smaller than the training set, it has no space to memorize it.
The larger the training set, the less impact any one work has. It is de minimis use, paradoxically, the more you take the less you imitate.
All still pending as far as I'm aware. The only concluded lawsuit is that LAION isn't responsible for how AI companies use it's dataset and that merely providing a tagged image index isn't in and of itself copyright infringement (and that lawsuit was ruled in Germany, not the US.)
I think it is more reasonable for content owners to say what can and cannot be done with their data. After all, content is what make AI possible, and content owners could easily start their own LLM if they wanted to since a lot of it is open source now.
they are not content "owners" though. they have a a copyright that regulates who can copy and distribute that data. they don't have a say how that content is used when acquired legally as long as you activity doesn't constitute a distribution.
(note; not a lawyer) It depends on if a model is a derivate work from it's source material or not. If yes, then all copyright protections come into force. If not, then the author can't rely on copyright to protect themselves.
My instinct/gut says that an AI model is a derivative work from the training data (in that it quite literally takes training data to produce a new creative output, with the "human addition" being the selection of training data to use), but there's not really clear judgements on it either way for the time being, which leaves room to argue.
The actual methodology used ("isn't an LLM like a computer reading a book for yourself?") is an irrelevant distraction in this regard. Computers aren't people and don't get that sort of protection; they're ultimately tools programmed to do things by humans and as a rule we hold humans responsible when those tools do something bad/wrong. "Computer says no" works on the small scale, but in cases like this, it's not really an adequate defense.
Or rather, that is how it should be; I think the uncomfortable truth here is that we need Congress to make laws to clarify the situation in the favor of society, and Congress does not seem willing to do that.
Doesn't synthetic data complicate this reasoning? If I train a model on synthetic data, which is not protected by copyright, I am free to do as I please. It won't even regurgitate the originals, it will learn the abstractions not memorize the exact expression, because it doesn't see it.
But it's not just supervised training. Maybe a model trained on reasoning traces and RLHF is not a mere derivative of the training set. All recent models are being trained on self generated data produced with reward or preference models.
When a model trains on a piece of text it won't derive gradients from the parts it knows, it will only absorb the novel parts. So what it takes from each example depends on the ordering of training examples. It is a process of diffing between model and text, could be seen as a form of analysis not simple memorization.
Even if it is infringement to train on protected works, the model size is 100x up to 1000x smaller than the training set, it has no space to memorize it.
The larger the training set, the less impact any one work has. It is de minimis use, paradoxically, the more you take the less you imitate.
That should matter when estimating damages.
Understood, Was there any conclusion to the past copy right cases that have been filed against open ai / anthropic ?
All still pending as far as I'm aware. The only concluded lawsuit is that LAION isn't responsible for how AI companies use it's dataset and that merely providing a tagged image index isn't in and of itself copyright infringement (and that lawsuit was ruled in Germany, not the US.)
That's what I do all the time, when I buy a copy of a book and read it.
Since there is no such thing as training rights, they would have a reasonable claim.
I think it is more reasonable for content owners to say what can and cannot be done with their data. After all, content is what make AI possible, and content owners could easily start their own LLM if they wanted to since a lot of it is open source now.
You're taking an "everything not permitted is forbidden" approach, which contradicts the common law principle of residual freedom.
This would automatically outlaw any new use of information (eg music sampling) by default.
If all novel uses were banned from the outset, cultural progress would suffer immeasurably.
4 replies →
they are not content "owners" though. they have a a copyright that regulates who can copy and distribute that data. they don't have a say how that content is used when acquired legally as long as you activity doesn't constitute a distribution.
That is not reasonable, should a child heed the restrictions placed on the 1st grade math book later in life, when they become PhD?
3 replies →
>I think it is more reasonable for content owners to say what can and cannot be done with their data.
They lose that right as soon as they sell it to other people.
No, you can't sell a book to someone and then sue anyone who reads the book, upside down.
That would be ridiculous. If you don't want someone reading your book upside down, or training on it, then don't sell books.
4 replies →
That is still an open question.