← Back to context

Comment by Nevermark

18 hours ago

> These things are very well and precisely defined in just about every jurisdiction.

Yes, we have lots of wording attempting to be precise. And legal uses of terms are certainly more precise by definition and precedent than normal language.

But ambiguities about facts are only half of it. Even when all the facts appear to be clear, human juries have to use their subjective human judgement to pair up what the law says, which may be clear in theory, but is often subjective at the borders, vs. the facts. And reasonable people often differ on how they match the two up in many borderline cases.

We resolve both types of ambiguities case-by-case by having a jury decide, which is not going to be consistent from jury to jury but it is the best system we have. Attorneys vetting prospective jurors are very much aware that the law comes down to humans interpreting human language and concepts, none of which are truly precise, unless we are talking about objective measures (like frequency band use).

---

> it is the question of what constitutes a derivative

Yes, the legal side can adapt.

And the technical side can adapt too.

The problem isn't that material was trained on, but that the resulting model facilitates reproducing individual works (or close variations), and repurposing individual's unique styles.

I.e. they violate fair use by using what they learn in a way that devalues other's creative efforts. Being exposed to copyrighted works available to the public is not the violation. (Even though it is the way training currently happens that produces models that violate fair use.)

We need models that one way or another, stay within fair use once trained. Either by not training on copyrighted material, or by training on copyrighted material in a way that doesn't create models that facilitate specific reproduction and repurposing of creative works and styles.

This has already been solved for simple data problems, where memorization of particular samples can be precluded by adding noise to a dataset. Important generalities are learned, but specific samples don't leave their mark.

Obviously something more sophisticated would need to be done to preclude memorization of rich creative works and styles, but a lot of people are motivated to solve this problem.