Comment by NobodyNada

7 months ago

One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)

Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?

[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

A judge already ruled that models themselves don't constitute copyright infringement in Kadrey v. Meta Platforms, Inc. (https://casetext.com/case/kadrey-v-meta-platforms-inc). The EFF has a good summary about it:

> the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.

See: https://www.eff.org/deeplinks/2025/02/copyright-and-ai-cases...

  • Time to overfit on some books and publicize them as a libgen mirror.

    • I think this could lead to interesting results outside the legalities.

      Imagine you're getting it to spit out lord of the rings, but midway through you inject into the output 'Suddenly, the ring split in two. No longer one ring to rule them all, but two!'.

      You then let the model write the rest of the story!

      3 replies →

Yes and no.

In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.

So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.

But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.

On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.

Also note:

- Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.

- See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).

[1] https://arxiv.org/abs/2412.06370

  • Wouldn't a model that can recite training data verbatim be larger than necessary? Exact text isn't coming from nowhere, no matter how efficiently the bits are encoded, and the same effectiveness should be achievable by compressing those portions of the model.

    • Maybe we are all just LLMs. If the books were written by a language producing algorithm in a human mind, maybe there’s not as much raw data there as it seems, and the total information can in fact be stored in a surprisingly small set of weights.

      1 reply →

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

  • This technology is a really bad way of storing, reproducing and transmitting the books themselves. It's probabilistic and lossy. It may be possible to reproduce some paragraphs, but no reasonable person would expect to read The Da Vinci Code by prompting the LLM. Surely the marketed use cases and the observed real use by users has to make it clear that the intended and vastly overwhelming use of an LLM is transformative, "digestive" synthesis of many sources to construct a merged, abstracted, generalized system that can function in novel uses, answering never before seen prompts in a useful manner, overwhelmingly without reproducing existing written works. It surely matters what the purpose of the thing is both in intention and observed practice. It's not a viable competing alternative to reading the actual book.

    • Not The DaVinci Code, but I recently tried reading "OCaml Programming: Correct + Efficient + Beautiful" through Gemini. The book is open, so I rightly assumed it was "in there". I read by saying "Give me the first paragraph of Chapter 6" and then something like "Next 3 paragraphs". If I had a question, I was able to ask it and get some more info and have something like a dialog.

      As far as I could tell, the book didn't match what's posted online today. The text was somewhat consistent on a topic, yet poorly written and made references to sections that I don't think existed. No amount of prompting could locate them. I'm not convinced the material presented to me was actually the book, although it seemed consistent with the topic of the chapter.

      I tried to ascertain when the book had been scraped, yet couldn't find a match in Archive.org or in the book's git repo.

      Eventually I gave up and just continued reading the PDF.

    • The number of people who buy Cliffs Notes versions of books to pass examinations where they claim to have read the actual book suggests you are way overestimating how "reasonable" many people are.

      4 replies →

  • > broadly capable open models are on track for annihilation

    I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).

    I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.

    • In theory, couldn't you distill a non-infringing model from an infringing one? Just prompt it for continuations and give it a whack every time the output matches something in your dataset of copyrighted works.

      You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.

  • > a model file that contains enough of the source material to be considered infringing

    The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.

    • But there is the issue of whether there are damages. If my LLM can reproduce 10 random paragraphs of a Harry Potter book, it's obvious that nobody would have otherwise purchased the book if they couldn't read those 10 paragraphs. So there will not be any damages to the publisher and the lawsuit will be tossed. There is a threshold of how much of it needs to be reproduced, and how closely, but it's a subjective standard and not some hard line like if it's > 50%.

      1 reply →

  • > even without using the LLM, assume you can extract the contents directly out of the weights

    This is still a weird language shift that actively promotes misunderstandings.

    The weights are the LLM. When you say "model", that means the weights.

  • > extract the contents directly out of the weights

    If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.

  • Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

    This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.

    The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.

No. You are free to memorize any copyrighted work. You are just not free to distribute it.

The model itself does not constitute a copy. Its intention is clearly not to reproduce verbatim texts. There would be far cheaper and infinitly more accurate ways to do that if that was the goal.

Appart from the legalities, it would be horrifying if copyright reached into the AI realm to completely styfle progress for, lets be honest, mainly the profits of a few major IP corporations.

I do however understand some creatives are worried about revenue, just like the rest of us. But just like the rest of us, they to live in a world that can only exist because 99.99% of what it took to build that world was automated or tool enhanced, impacting someone's previous employment or business.

We are in a world of unprecedented change, only to be immediatly supassed by the next day's rate of change. This both scares and fascinates me.

But that change and its benefits being held only in the bowels of corporate/government symbiotic entities would scare me a hell of a lott more. Open Source/weights is the only way to have a small chance to keep this at bay.

> One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works,

No, it doesn't. The order assumes that because it is an order on summary judgement, and the legal standard for such an order is that it must assume the least favorable position for the party for whom summart judgement is granted on every material contested issue of fact. Since it is a ruling for the defendant (Anthropic), it must be what the judge finds law demands when assuming all contested issues of fact are resolved in favor of the claims of the plaintiffs (the authors).

> but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text.

No, it doesn't do that, either. It simply notes for clarity that the plaintiffs do not allege that that an infringement is created by the outputs for the reason you describe; the ruling does not in any way suggest that has any bearing on its findings as regards whether training the model infringes, it simply points out that that separate potential source of infringement is not at issue.

> Does this imply that distributing open-weights models such as Llama is copyright infringemen

No, it does not. At most, it implies, given the reason that rhe plaintiffs have not done so in this case, that the same plaintiffs might have alleged (without commenting at all as to whether they would prevail) that providing a hosted online service without filtering would constitute contributory infringement if that was what Anthropic did (which it isn’t) and if there was actual infringement committed by the users of the service.

Copyright was codified in an age where plagiarism was time consuming. Even replacing words with synonyms on a mass scale was technically infeasible.

The goal of copyright is to make sure people can get fair compensation for the amount of work they put in. LLMs automate plagiarism on a previously unfathomable scale.

If humans spend a trillion hours writing books, articles, blog posts and code, then somebody (a small group of people) comes and spends a million hours building a machine that ingests all the previous work and produces output based on it, who should get the reward for the work put in?

The original authors together spent a million times more effort (normalized for skill) and should therefore should get a million times bigger reward than those who build the machine.

In other words, if the small group sells access to the product of the combined effort, they only deserve a millionth of the income.

---

If "AI" is as transformative as they claim, they will have no trouble making so much money they they can fairly compensate the original authors while still earning a decent profit. But if it's not, then it's just an overpriced plagiarism automator and their reluctance to acknowledge they are making money on top of everyone else's work is indicative.

  • > get fair compensation for the amount of work

    This is a bit distorted. This is a better summary: The primary purpose of copyright is to induce and reward authors to create new works and to make those works available to the public to enjoy.

    The ultimate purpose is to foster the creation of new works that the public can read and written culture can thrive. The means to achieve this is by ensuring that the authors of said works can get financial incentives for writing.

    The two are not in opposition but it's good to be clear about it. The main beneficiary is intended to be the public, not the writers' guild.

    Therefore when some new factor enters the picture such as LLMs, we have to step back and see how the intent to benefit the reading public can be pursued in the new situation. It certainly has to take into account who and how will produce new written works, but it is not the main target, but can be an instrumental subgoal.

    • As you point out, people make rules ("laws") which benefit them. I care about fairness and justice though, even if I am a minority.

      Fundamentally, fair compensation is based on the amount of work put in (obviously taking skill/competence into account but the differences between people in most disciplines probably don't span a single order of magnitude, let alone several).

      The ultimate goal should be to prevent people who don't produce value from taking advantage of those who do. And among those who do, that they get compensated according to the amount of work and skill they put in.

      Imagine you spend a year building a house. I have a machine that can take your house and materialize a copy anywhere on earth for free. I charge people (something between 0 and the cost of building your house the normal way) to make them a copy of your house. I can make orders of magnitude more money this way than you. Are you happy about this situation? Does it make a difference how much i charge them?

      What if my machine only works if I scan every house on the planet? What if I literally take pictures of it from all sides, then wait for your to not be home and xray it to see what it looks like inside?

      You might say that you don't care because now you can also afford many more houses. But it does not make you richer. In fact, it makes you poorer.

      Money is not a store of value. If everyone has more money but most people only have 2x more and a small group has a 1000x more, then the relative bargaining power changed so the small group is better off and the large group is worse off. This is what undetectable cheap mass plagiarism leads to for all intellectual work.

      ---

      I wrote a lot of open source code, some of it under permissive licenses, some GPL, some AGPL. The conditions of those licenses are that you credit me. Some of them also require that if you build on top of my work, you release your work with the same licence.

      LLMs launder my code to make profit off of it without giving me anything (while other people make profit, thus making me poorer) and without crediting me.

      LLMs also take away the rights of the users of my code - (A)GPL forced anyone who builds on top of my work to release the code when asked, with LLM-laundered code, this right no longer seems to exist because who do you even ask?

      9 replies →

  • Copyright's goal, at least under Constitution under which this court is ruling is to "promote the progress of science and the useful arts" not to ensure that authors get paid for anything that strikes their whim.

    LLMs are models of languages, which are models of reality. If anyone deserves compensation, it's humanity as a whole, for example by nationalizing, or whatever the global equivalent is, LLMs.

    Approximately none of the value of LLMs, for any user, is in recreating the text written by an author. Authors have only ever been entitled to (limited) ownership their expression, copyright has never given them ownership of facts.

Wouldn’t the issue be executing the models to third parties without filters? No idea if this is right but the same it would apply to Anthropic that they couldn’t run the model without the filter system having a chicken an egg problem. Can’t develop the filter without looking into the model.

I am yet to have anyone explain to my why LLM memorisation is worse than Google images or a similar service caching thumbnails for faster image searches. Or caching blurbs of news stories for faster reproduction at search time.

You can use the copyrighted text for personal purposes.

  • But you can’t distribute it, which in the scenario mentioned in the parent’s final paragraph arguably happens.

    • You can't distribute the copyrighted works, but that isn't inherently the same thing as the model.

      It's sort of like distributing a compendium of book reviews. Many of the reviews have quotes from the book. If there are thousands of reviews, you could potentially reconstruct the whole book, but that's not the point of the thing and so it makes sense for the infringing thing to be "using it to reconstruct the whole book" rather than "distributing the compendium".

      And then Anthropic fended off the argument that their service was intended for doing the former because they were explicitly taking measures to prevent that.

      2 replies →

  • You can also, in the US, use it for any purposes which fall within the domain of "fair use", which while now also incorporated in the copyright statute, was first identified as an application of the first amendment and, as such, a constitutional limit on what Congress even had the power to prohibit with copyright law (the odd parameters of the statutory exception are largely because it attempted to codify the existing Constitutional case law.)

    Purposes which are fair use are very often not at all personal.

    (Also, "personal use" that involves copying, creating a derivative work, or using any of the other exclusive rights of a copyright holder without a license or falling into either fair use or another explicit copyright exception are not, generally, allowed, they are just hard to detect and unlikely to be worth the copyright holder's time to litigate even if they somehow were detected.)

  • Hey can I have a fake llm "trained" on a set of copyrighted works to ask what those works are?

    So it totally isn't a warez streaming media server but AI?

    I'm guessing since my net worth isn't a billion plus, the answer is no

    • People have been coming up with convoluted piracy loopholes since the invention of copyright.

      If you xor some data with random numbers, both the result and the random numbers are indistinguishably random and there is no way to tell which one came out of a random number generator and which one is "derived" from a copyrighted work. But if you xor them together again the copyrighted work comes out. So if you have Alice distribute one of the random looking things and Bob distribute the other one and then Carol downloads them both and reconstructs the copyrighted work, have you created a scheme to copy whatever you want with no infringement occurring?

      Of course not, at least Carol is reproducing an infringing work, and then there are going to be claims of contributory infringement etc. for the others if the scheme has no other purpose than to do this.

      Meanwhile this problem is also boring because preventing anyone from being the source of infringing works isn't a thing anybody has been able to do since at least as long as the internet has allowed anyone to set up a server in another jurisdiction.