← Back to context

Comment by 3PS

7 months ago

Broadly summarizing.

This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.

This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.

Overall seems like a pretty reasonable ruling?

But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine. I guess I fail to see how it's any different from me using it in some other way? If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.

"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)

  • >If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

    I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).

    (Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)

    • I think the issue is that its actually quite difficult to "unlearn" something once you've seen it. I'm speaking more from human-learning rather than AI-learning, but since AI is inspired by our view on nature, it will have similar qualities. If I see something that inspires, regardless of if I paid for that, I may not even know what specifically inspired me. If I sit on a park bench and an idea comes to me, it could come from a number of things - the bench, park, weather, what movie I watched last night, stuff on the wall of a restaurant while I was eating there, etc.

      While humans don't have encyclopedic memories, our brain connects a few dots to make a thought. If I say "Luke, I am your father", it doesn't matter that isn't even the line is wrong, anyone that's seen Star Wars knows what I'm quoting. I may not be profiting from using that line, but that doesn't stop Star Wars from inspiring other elements of my life.

      I do agree that copyright law is complicated and AI is going to create even more complexity as we navigate this growth. I don't have a solution on that front, just a recognition that AI is doing what humans do, only more precisely.

    • which AFAIN IANAL, copyright and exhaustive rights are completely different. Under copyright, once a book is purchased: that's it. Reselling the same, or transformed (re: highlighted) worked 'used' is 100% legal, as is consuming it at your discretion (in your mind {a billion times}, a fire, or (yes even) what amounts to a fancy calculator).

      (that's all to say copyright is dated and needs an overhaul)

      But that's taking a viewpoint of 'training a personal AI in your home', which isn't something that actually happens... The issue has never been the training data itself. Training an AI and 'looking at data and optimizing a (human understanding/AI understanding) function over it' are categorically the same, even if mechanically/biologically they are very different.

  • > I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission."

    That's not what the ruling says.

    It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.

    These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)

  • But if you did pirate the book, and let's say it cost $50, and then you used it to write a play based on that book and made $1 million selling that, only the $50 loss to the publisher would be relevant to the lawsuit. The fact that you wrote a non-infringing play based on it and made $1 million would be irrelevant to the case. The publisher would have no claim to it.

  • The judge actually agreed with your first paragraph:

    > This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.

    (But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)

  • The analogy to training is not writing a play based on the work. It's more like reading (experiencing) the work and forming memories in your brain, which you can access later.

    I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.

    • This is nonsense, in my opinion. You aren't "hearing" anything. You are literally creating a work, in this case, the model, derived from another work.

      People need to stop anthropomorphizing neural networks. It's a software and a software is a tool and a tool is used by a human.

      3 replies →

  • If I buy a book, and use it to prop up the table on which I build a door, I dont owe the author any additional money over what I paid for it.

    If I buy a book, and as long as the product the book teaches me to build isnt a competing book, the original author should have no avenue for complaint.

    People are really getting hung up on the computer reading the data and computing other data with it. It shouldnt even need to get to fair use. Its so obviously none of the authors business well before fair use.

  • > But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine.

    Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.

    Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.

    Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?

    • > Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work.

      This makes no sense. If I buy and read a book on software engineering, and then use that knowledge to start a career, do I owe the author a percentage of my lifetime earnings?

      Of course not. And yet I've made money with the help of someone else's intellectual work.

      Copyright is actually pretty narrowly defined for _very good reason_.

      3 replies →

Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.

Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions

  • > Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

    Human time is inherently valuable, computer time is not.

    The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)

    If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.

    The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.

    That is fundamentally exploitative, whether the current laws accounted for that situation or not.

  • That's a part of the issue. I'm not sure if this has happened in visual arts, but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

    I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.

    • > You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet"

      Keep in mind, the Authors in the lawsuit are not claiming the _output_ is copyright infringement so Alsup isn't deciding that.

    • > but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

      You're referencing Midler v Ford Motor Co in the 9th circuit. This case largely applies to California, not the whole nation. Even then, it would take one Supreme Court case to overturn it.

  • > Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

    How many copies? They're not serving a single client.

    Libraries need to have multiple e-book licenses, after all.

    • In the human training case probably a Store DVD would still run afoul of that licensing issue. That's a broader topic of audience and I didn't want to muddy the analogy with that detail.

      It changes the definition of what a "legal copy" is but the general idea that the copy must be legal still stands.

      1 reply →

  • What you are describing happened and they got sued:

    https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...

    I'm on the Air Pirates side for the case linked, by the way.

    However, AI is not a parody. It's not adding to the cultural expression like a parody would.

    Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:

    Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.

    • > It's not adding to the cultural expression like a parody would.

      Says who?

      > Is AI contributing to education and/or culture _right now_, or is it trying to make money?

      How on earth are those things mutually exclusive? Also, whether or not it's being used to make money is completely irrelevant to whether or not it is copyright infringement.

      3 replies →

Agreed. If I memorize a book and I am deployed into the world to talk about what I memorized that is not a violation of copyright. Which is reasonable logically because essentially this is what an LLM is doing.

  • It might be different if you are a commercial product which couldn’t have been created without incorporating the contents of all those books.

    Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.

    • But a commercial product is reaching parity with human capability.

      Let's be real, Humans have special treatment (more special than animals as we can eat and slaughter animals but not other humans) because WE created the law to serve humans.

      So in terms of being fair across the board LLMs are no different. But there's no harm in giving ourselves special treatment.

      2 replies →

  • Except you can't do it at a massive scale. LLMs both memorize at a scale bigger than thousands, probably millions of humans AND reproduce at an essentially unlimited scale.

    And who gets the money? Not the original author.

  • You can talk about it, but you can't sell tickets to an event where you recite from memory all the poems written by someone else without their permission.

    LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.

Depends whether you actually agree its transformative

  • For textual purposes it seems fairly transformative.

    If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.

    However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.

  • What's the steelman case that is transformative? Because prima-facie, it seems to only output original output - "intelligent" output.

If a publisher adds a "no AI training" clause to their contracts, does this ruling render it invalid?

  • You don't need a license for most of what people do with traditional, physical copyrighted copies of works: read them, play a DVD at home, etc. Those things are outside the scope of copyright. But you do need a license to make copies, and ebooks generally come with licensing agreements, again because to read an ebook, you must first make a brand new copy of it. Anyway as a result physical books just don't have "licenses" to begin with and if they tried they'd be unenforceable, since you don't need to "agree" to any "terms" to read a book.

  • > If a publisher adds a "no AI training" clause to their contracts?

    This ruling doesn't say anything about the enforceability of a "don't train AI on this" contract, so even if the logic of this ruling became binding prcecednet (trial court rulings aren't), such clauses would be as valid after as they are today. But contracts only affect people who are parties to the contract.

    Also, the damages calculations for breach of contract are different than for copyright infringement; infringement allows actual damages and infringer's profits (or statutory damages, if greater than the provable amount of the others), but breach of contract would usually be limited to actual damages ("disgorgement" is possible, but unlike with infringer's profits in copyright, requires showing special circumstances.)

  • Fair Use and similar protections are there to protect the end user from predatory IP holders.

    First, I dont think publishers of physical books in the US get the right to establish a contract. The book can be resold for instance and that right cannot be diminished. But secondly adding more cruft to the distribution of something that the end user has a right to transform, isn't going to diminish that right.

  • Fair use overrides licensing

    • Fair use "overrides" licensing in the sense that one doesn't need a copyright license if fair use applies. But fair use itself isn't a shield against breach of contract. If you sign a license contract saying you won't train on the thing you've licensed, the licensor still has remedies for breach of contract, just not remedies for copyright infringement (assuming the act is fair use).

      3 replies →

  • what contract? with who?

    Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.

    • I know, but the article mentions that a separate ruling will be made about that pirating.

      quote: “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”

      This tells me Anthropic acquired these books legally afterwards. I was asking if during that purchase, the seller could add a no training close to the sales contract.

      1 reply →

It’s similar to the Google Books ruling, which Google lost. Anthropic also lost. TechCrunch and others are very aspirational here.

  • Do you mean Authors Guild, Inc. v. Google, Inc.? Google won that case:

    https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

    Maybe there's another big Google Books lawsuit that Google ultimately lost, but I don't know which one you mean in that case.

    • see, but if you ask a copyright attorney: Google lost. This is what I mean by aspirational. They won something, in very similar circumstances to Anthropic, "fair use," but everything else that made what they were doing a practical reality instead of purely theoretical required negotiation with Authors Guild, and indeed, they are not doing what they wanted to do, right? Anthropic has to go to trial still, they had to pirate the books to train, and they will not win on their right to commercialize the results of training, because neither did Google, so what good is the Fair Use ruling, besides allowing OpenAI v. NYTimes to proceed a little longer?

      1 reply →

What if I overfit my LLM so it spits out copyrighted work with special prompting? Where to draw the line in training?

  • If you do something else, the result may be something else. The line is drawn by the application of subjective common sense by the judge, just as it is every time.

  • I mean the human brain can memorize things as well and it’s not illegal. It’s only illegal if said memorized thing is distributed.

    • Humans don't scale. LLMs do.

      Even if LLMs were actual human-level AI (they are not - by far), a small bunch of rich people could use them to make enormous amounts of money without putting in the enormous amounts of work humans would have to.

      All the while "training" (= precomputing transformations which among other things make plagiarism detection difficult) on work which took enormous amounts of human labor without compensating those workers.

    • Humans can only memorize such few texts in comparison so they'd not be scallable in the same sense LLMs are.

BRB, I'm going to download all the TV shows and movies to train my vision model. Just to be sure it's working properly, I have to watch some for debugging purposes.