Comment by duskwuff

16 hours ago

Same. If I invented a novel new way of encoding video and used it to pack a bunch of movies into a single file, I would fully expect to be sued if I tried distributing that file, and equally so if I let people use a web site that let them extract individual videos from that file. Why should text be treated differently?

You are allowed to quote from copyrighted works without needing permission. Trying to assert copyright because of a quote of, say, a mere 60 words in length would get you thrown out of any judge’s court.

It was shown, in this case, that the llms wouldn’t generate accurate quotes more than 60 words in length.

This is not comparable to encoding a full video file.

I think the better analogy is if you had someone with a superhuman, but not perfect memory read a bunch of stuff, then you were allowed to talk to the person about the things they’d read, does that violate copyright? I’d say clearly no.

Then what if their memory is so good, they repeat entire sections verbatim when asked. Does that violate it? I’d say it’s grey.

But that’s a very specific case - reproducing large chunks of owned work is something that can be quite easily detected and prevented and I’m almost certain the frontier labs are already going this.

So I think it’s just very not clear - the reality is this is a novel situation, the job of the courts is now to basically decide what’s allowed and what’s not. But the rational shouldn’t be ‘this can’t be fair use it’s just compression’. Because it’s clearly something fundamentally different and existing laws just aren’t applicable imo

  • That's not a great analogy. A person is expected to use their discretion, and can be held legally liable for their actions. A machine is not, and cannot.

    > Then what if their memory is so good, they repeat entire sections verbatim when asked. Does that violate it? I’d say it’s grey.

    That's an unambiguous "yes". Performing a copyrighted play or piece of music without the rights to do so is universally considered a copyright violation, even if the performers are performing from memory. It's still a copyright violation if they don't remember their parts perfectly and have to ad-lib sometimes, or if they don't perform the entire work from start to finish.

This a strawman, in the sense that it is not accurate to think about AI models as a compressed form of their training data, since the lossiness is so high. One of the insights from the trial is the LLMs are particularly poor at reproducing original texts (60 tokens was the max found in this trial, IIRC). This is taken into account when considering fair use based on the fourth fair use factor: how the work impacts the market for the original work. It's hard to make an argument that LLMs are replacing long-form text works, since they have so much trouble actually producing them.

There's a whole related topic here in the realm of news (since it's shorter form), but it also has a much shorter half-life. Not sure what I think there yet.