Comment by visarga

9 months ago

Doesn't synthetic data complicate this reasoning? If I train a model on synthetic data, which is not protected by copyright, I am free to do as I please. It won't even regurgitate the originals, it will learn the abstractions not memorize the exact expression, because it doesn't see it.

But it's not just supervised training. Maybe a model trained on reasoning traces and RLHF is not a mere derivative of the training set. All recent models are being trained on self generated data produced with reward or preference models.

When a model trains on a piece of text it won't derive gradients from the parts it knows, it will only absorb the novel parts. So what it takes from each example depends on the ordering of training examples. It is a process of diffing between model and text, could be seen as a form of analysis not simple memorization.

Even if it is infringement to train on protected works, the model size is 100x up to 1000x smaller than the training set, it has no space to memorize it.

The larger the training set, the less impact any one work has. It is de minimis use, paradoxically, the more you take the less you imitate.