Comment by omeid2

7 months ago

> There are very few laws that are not giant ambiguities. Where is the line between murder, self-defense and accident? There are no lines in reality.

These things are very well and precisely defined in just about every jurisdiction. The "ambiguities" arise from ascertaining facts of the matter, and whatever some facts fits within a specific set of set rules.

> Something must change regarding copyright and AI model training.

Yes, but this problem is not specific to AI, it is the question of what constitutes a derivative, and that is a rather subjective matter in the light of the good ol' axiom of "nothing is new under the sun".

> These things are very well and precisely defined in just about every jurisdiction.

Yes, we have lots of wording attempting to be precise. And legal uses of terms are certainly more precise by definition and precedent than normal language.

But ambiguities about facts are only half of it. Even when all the facts appear to be clear, human juries have to use their subjective human judgement to pair up what the law says, which may be clear in theory, but is often subjective at the borders, vs. the facts. And reasonable people often differ on how they match the two up in many borderline cases.

We resolve both types of ambiguities case-by-case by having a jury decide, which is not going to be consistent from jury to jury but it is the best system we have. Attorneys vetting prospective jurors are very much aware that the law comes down to humans interpreting human language and concepts, none of which are truly precise, unless we are talking about objective measures (like frequency band use).

---

> it is the question of what constitutes a derivative

Yes, the legal side can adapt.

And the technical side can adapt too.

The problem isn't that material was trained on, but that the resulting model facilitates reproducing individual works (or close variations), and repurposing individual's unique styles.

I.e. they violate fair use by using what they learn in a way that devalues other's creative efforts. Being exposed to copyrighted works available to the public is not the violation. (Even though it is the way training currently happens that produces models that violate fair use.)

We need models that one way or another, stay within fair use once trained. Either by not training on copyrighted material, or by training on copyrighted material in a way that doesn't create models that facilitate specific reproduction and repurposing of creative works and styles.

This has already been solved for simple data problems, where memorization of particular samples can be precluded by adding noise to a dataset. Important generalities are learned, but specific samples don't leave their mark.

Obviously something more sophisticated would need to be done to preclude memorization of rich creative works and styles, but a lot of people are motivated to solve this problem.

  • It seems like your concerns is about how easy it is going to be to create derivative and similar work, rather than a genuine concerns for copyright. Do I understand correctly?

    • No, I am just narrowing down the problem definition to the actual damage.

      Which is a very fair use and copyright respecting approach.

      Taking/obtaining value from works is ok, up until the point where damage to the value of original works happen. And that is not ok. Because copyright protects that value to incentivize the creation and sharing of works.

      The problem is that models are shipping that inherently make it easy to reproduce copyrighted works, and apply specific styles lifted from single author's copyrighted bodies of work.

      I am very strongly against this.

      Note that prohibiting copying of a recognizable specific single author's style is even more strict than fair use limits on humans. Stricter makes sense to me, because unlike humans, models are mass producers.

      So I am extremely respectful of protecting copyright value.

      But it is not the same thing as not training on something. It is worth exploring training algorithms that can learn useful generalities about bodies of work, without retaining biases toward the specifics of any one work, or any single authored style. That would be in the spirit of fair use. You can learn from any art, if it's publicly displayed, or you have paid for a copy, but you can't create mass copiers of it.

      Maybe that is impossible, but I doubt it. There are many ways to train that steer important properties of the resulting models.

      Models that make it trivial to create new art deco works, consistent with the total body of art deco works, ok. Models that make it trivial to recreate Erte works, or with an accurately Erte style specifically. Not ok.

      1 reply →