← Back to context

Comment by throw310822

9 months ago

In fairness, I don't think Meta would have (had) any trouble paying the fair price of every book they downloaded (the price of exactly one copy) if that had been possible to do at scale.

Paying the price of one copy does not imply that you can use it for training, right ?

  • (note; not a lawyer) It depends on if a model is a derivate work from it's source material or not. If yes, then all copyright protections come into force. If not, then the author can't rely on copyright to protect themselves.

    My instinct/gut says that an AI model is a derivative work from the training data (in that it quite literally takes training data to produce a new creative output, with the "human addition" being the selection of training data to use), but there's not really clear judgements on it either way for the time being, which leaves room to argue.

    The actual methodology used ("isn't an LLM like a computer reading a book for yourself?") is an irrelevant distraction in this regard. Computers aren't people and don't get that sort of protection; they're ultimately tools programmed to do things by humans and as a rule we hold humans responsible when those tools do something bad/wrong. "Computer says no" works on the small scale, but in cases like this, it's not really an adequate defense.

    • Or rather, that is how it should be; I think the uncomfortable truth here is that we need Congress to make laws to clarify the situation in the favor of society, and Congress does not seem willing to do that.

    • Doesn't synthetic data complicate this reasoning? If I train a model on synthetic data, which is not protected by copyright, I am free to do as I please. It won't even regurgitate the originals, it will learn the abstractions not memorize the exact expression, because it doesn't see it.

      But it's not just supervised training. Maybe a model trained on reasoning traces and RLHF is not a mere derivative of the training set. All recent models are being trained on self generated data produced with reward or preference models.

      When a model trains on a piece of text it won't derive gradients from the parts it knows, it will only absorb the novel parts. So what it takes from each example depends on the ordering of training examples. It is a process of diffing between model and text, could be seen as a form of analysis not simple memorization.

      Even if it is infringement to train on protected works, the model size is 100x up to 1000x smaller than the training set, it has no space to memorize it.

      The larger the training set, the less impact any one work has. It is de minimis use, paradoxically, the more you take the less you imitate.

      That should matter when estimating damages.

  • Since there is no such thing as training rights, they would have a reasonable claim.

    • I think it is more reasonable for content owners to say what can and cannot be done with their data. After all, content is what make AI possible, and content owners could easily start their own LLM if they wanted to since a lot of it is open source now.

      15 replies →

They would have a better defense if they had escrowed that money and/or reasonably tried to buy.

  • Indeed. Although there is the case of owners of rights invited to come forward to receive their due, if it wasn't possible to contact them before. You probably need a proof that you made an effort though.

    It's also true that anyone can go to a public library and read all the contents for free- the point is they can't further distribute them except in a highly processed form (i.e. they can distribute original products influenced by what they have read). Here the issue is the scale of both the "reading" part and of the "producing original work" part.