Comment by Incipient
12 hours ago
I'm not sure how llms count as fair use. It's just that we can't show HOW they've been encoded in the model, means it's fair use? Or that statistical representations are fair use? Or is it the generation aspect? I can't sell you a Harry potter book, but I can sell you some service that let's you generate it yourself?
I feel like this has really blown a hole in copyright.
> It's just that we can't show HOW they've been encoded in the model, means it's fair use?
Describing training as “encoding them in the model” doesn’t seem like an accurate description of what is happening. We know for certain that a typical copyrighted work that is trained on is not contained within the model. It’s simply not possible to represent the entirety of the training set within a model of that size in any meaningful way. There are also papers showing that memorisation plateaus at a reasonably low rate according to the size of the model. Training on more works doesn’t result in more memorisation, it results in more generalisation. So arguments based on the idea that those works are being copied into the model don’t seem to be founded in fact.
> I can't sell you a Harry potter book, but I can sell you some service that let's you generate it yourself?
That’s the reason why cases like this are doomed to fail: No model can output any of the Harry Potter books. Memorisation doesn’t happen at that scale. At best, they can output snippets. That’s clearly below the proportionality threshold for copyright to matter.
> That’s clearly below the proportionality threshold for copyright to matter.
This type of reasoning keeps coming up with seemingly zero consideration for why copyright actually exists. The goal of copyright, under US law, is "To promote the progress of science and useful arts".
The goal of companies creating these LLMs is to supersede the use of source material they draw from, like books. You use an LLM because it has all the answers without having to spend the money compensating the original authors, or put in the work digesting it yourself, that's their entire value proposition.
Their end game is to create a product so good that nobody has a reason to ever buy a book again. A few hours after you publish your book, the LLM will gobble it up and distribute the insights contain within to all of their users for free, "it's fair use", they say. There won't be any economic incentive to write books at that point, and so "the progress of science and useful arts" will crawl to a halt. Copyright defeated.
If LLM companies are allowed to produce market substitutes of original works then the goal of copyright is being defeated on a technicality and this ought to be a discussion about whether copyright should be abolished completely, not a discussion about whether big tech should be allowed to get away with it.
> The goal of companies creating these LLMs is to supersede the use of source material they draw from, like books.
Nobody is going to stop buying Harry Potter books because they can get an LLM to spit out ~50 words from one of the books. The proportionality factor is very clearly relevant here.
> If LLM companies are allowed to produce market substitutes of original works
Did Meta publish a book written by an LLM?
> The goal of copyright, under US law, is "To promote the progress of science and useful arts".
I would consider training LLMs to be very much in line with those goals.
Quoting Judge Alsup from his recent ruling in Bartz v. Anthropic.
> Instead, Authors contend generically that training LLMs will result in an explosion of works competing with their works — such as by creating alternative summaries of factual events, alternative examples of compelling writing about fictional events, and so on. This order assumes that is so (Opp. 22–23 (citing, e.g., Opp. Exh. 38)). But Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act. The Act seeks to advance original works of authorship, not to protect authors against competition.
Copyright was build to protect the artist from unauthorized copy by a human not by a machine (a machine wildly beyond their imagination at the time I mean), so the input and output limitations of humans were absolutely taken into account when writing such laws, if LLMs were treated in similar fashion authors would have had a say in wether their works can be used as inputs in such models or if they forbid it.
This reply doesn’t seem to relate to either of the points I made.
12 replies →
Same. If I invented a novel new way of encoding video and used it to pack a bunch of movies into a single file, I would fully expect to be sued if I tried distributing that file, and equally so if I let people use a web site that let them extract individual videos from that file. Why should text be treated differently?
You are allowed to quote from copyrighted works without needing permission. Trying to assert copyright because of a quote of, say, a mere 60 words in length would get you thrown out of any judge’s court.
It was shown, in this case, that the llms wouldn’t generate accurate quotes more than 60 words in length.
This is not comparable to encoding a full video file.
I think the better analogy is if you had someone with a superhuman, but not perfect memory read a bunch of stuff, then you were allowed to talk to the person about the things they’d read, does that violate copyright? I’d say clearly no.
Then what if their memory is so good, they repeat entire sections verbatim when asked. Does that violate it? I’d say it’s grey.
But that’s a very specific case - reproducing large chunks of owned work is something that can be quite easily detected and prevented and I’m almost certain the frontier labs are already going this.
So I think it’s just very not clear - the reality is this is a novel situation, the job of the courts is now to basically decide what’s allowed and what’s not. But the rational shouldn’t be ‘this can’t be fair use it’s just compression’. Because it’s clearly something fundamentally different and existing laws just aren’t applicable imo
This a strawman, in the sense that it is not accurate to think about AI models as a compressed form of their training data, since the lossiness is so high. One of the insights from the trial is the LLMs are particularly poor at reproducing original texts (60 tokens was the max found in this trial, IIRC). This is taken into account when considering fair use based on the fourth fair use factor: how the work impacts the market for the original work. It's hard to make an argument that LLMs are replacing long-form text works, since they have so much trouble actually producing them.
There's a whole related topic here in the realm of news (since it's shorter form), but it also has a much shorter half-life. Not sure what I think there yet.
One should also keep in mind the countless people who got much of their education from pirated books.
That has nothing to do with whether LLMs are fair use.
This seems to be a bad faith argument, although it would be amusing to see Facebook use it.
"Your honor, it's fair use because students have downloaded educational books for years."
People can process only fraction of that while using much more time to do that. And I'm using 'process' here to meet you in your nihilist argument that these algorithms are same as humans. Which is pretty strange because people barely acknowledge similarities with other mammals but suddenly software is equal.
I'm inclined to agree here. LLMs do not use just a paragraph here and there in accordance with fair use, but rather uses the entire body of work to train itself.
Or am I misunderstanding something about LLMs?
The judge is claiming that because the use is of the books are “so transformative,” the usage of these books to train an llm is fair use.
I’m not familiar with the facts of the case and IANAL, and its late, but how did the plaintiffs determine their books were being used for training of the llm? Was the model spitting out language that was similar or verbatim to their works?
> The judge is claiming that because the use is of the books are “so transformative,” the usage of these books to train an llm is fair use.
Maybe I'm mistaken but shouldn't the source come from a legal source ? This is not public domain material.
Again if I download the entire works of HBO tv shows, then make a "transformative" version on my iphone, how can that be considered fair use?
> Maybe I'm mistaken but shouldn't the source come from a legal source
There is no such thing as a legal or illegal source, only legal or illegal uses.
If the use was legal, then it doesn't matter where you got the material from. Similary if you got the material via more conventional means it would still be copyright infringement if you used it in an illegal way.
> Again if I download the entire works of HBO tv shows, then make a "transformative" version on my iphone, how can that be considered fair use?
That wouldn't be considered transformative. In this context "transformative" means you transformed it into something with a different purpose than the original.
However if you for example made a video essay for youtube talking about the themes (or whatever) of the tv show including clips from it, that would be transformative and probably fine.
1 reply →
It'll be, but in a slightly different way. As it will be considered _fair_ for the Warner Bros to sue you dry.
> but how did the plaintiffs determine their books were being used for training of the llm?
I think facebook admited this. I don't think the fact of this is under dispute.
>The judge is claiming that because the use is of the books are “so transformative,” the usage of these books to train an llm is fair use.
"you're doing something so critical to our (country's) success, that we're ok to waive copyright. I get that, if the US doesn't do it, then China will(is).
Interesting judgement, and it's implications, if you are correct haha.
"Transformative" is a legal term with a specific meaning. The copyrighted work has to be transformed somehow rather than copied.
https://en.m.wikipedia.org/wiki/Transformative_use
The word transformative was put there in a time of manual transformative processes, like when you paint something similar to what you saw in a painting by another artist, with all the implied limitations that entails, like the time it took from you to watch that painting, and the time it takes you to create that new painting, nothing to do at all with the way LLMs operate, an honest assessment would have found that the word was meant for a wildly different use case and therefore it required a bigger and more nuanced discussion.
> The word transformative was put there in a time of manual transformative processes, like when you paint something similar to what you saw in a painting by another artist
Do you have any citation that that is how the word "transformation" was understood historically? Because what your suggesting seems to be the opposite of what i've read.
My understanding is even back in the 1800s (e.g. https://en.wikipedia.org/wiki/Folsom_v._Marsh ) your example would not be considered transformative, if your intention was to make a similar painting to serve a similar purpose.