Comment by nicce

11 hours ago

Yet they did not need to destroy the models which were trained with them?

42 comments

nicce

Using them was allowed as fair use – it was the downloading of the pirated copies that was infringement. That's why Anthropic switched to scanning paper books.

maccard 11 hours ago
> That's why Anthropic switched to scanning paper books.
After they threw away all the tainted data from the pirated books, right?
- ascorbic 10 hours ago
  
  No, because the judge ruled that the training was fair use and the model itself wasn't infringing.
  
  1 reply →
- tidojo 10 hours ago
  
  Yes, as part of the settlement
  
  1 reply →
pera 9 hours ago

> Using them was allowed as fair use
That is only relevant in the US, and even there it is still not clear-cut whether the fair use doctrine applies on all these scenarios. Outside of the US the situation is also quite different: for example take a look at the recent ruling on GEMA vs OpenAI in Germany.
The reality is that the copyright issue with generative AI is very complex and reaching anything resembling a conclusion will take much more than a few opinion paragraphs from an American district judge.
kykeonaut 10 hours ago
Isn't scanning also a form of copyright infringement? You are making a digital copy of a book, which is the same thing as downloading a book from the internet...
- pmontra 10 hours ago
  
  I think that we can run a perhaps silly thought experiment.
  Suppose that I have a nearly perfect memory and I could remember all the books I read. Suppose also that I have a million year life span so I could read 7 million books. Then, what happens if at the end of all of those years, or at any earlier moment I answer questions from people and I exploit commercially the knowledge I gathered reading those books? Would my reading those books be study or copyright infringement? Remember the nearly perfect memory hypotheses.
  Of course it's a bit silly because the time to train a LLM and the time I need to read all those books is different by orders of magnitude and that changes the perspective. Who would complain with me today if their heirs lose some money on 7 million AD? Who would even notice that I started that million years long endeavor. Who's going to be there to ask me questions by then? Humans? Birds? Lizards? And I can say that I am studying like everybody else before me, but does an LLM study? And I am sure there are many other nuances.
  Anyway, I don't think that scanning is any different than photons hitting my retina. The difference is in what happens next: the faithfulness of memory, the amount of knowledge, the speed of accumulating it. After all a huge amount of quantity can become quality.
  
  8 replies →
- reedciccio 9 hours ago
  
  No, there is a famous law case to prove that's allowed: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
- maxloh 8 hours ago
  
  Copyright protects the presentation of knowledge, not the knowledge itself, which is uncopyrightable in almost all jurisdictions.
  As long as the book was a legal copy, that is allowed legally.
- monegator 10 hours ago
  
  Here we have a 15% limit on scanning for fair use
- shakna 9 hours ago
  
  As long as it is destructive, and the digital copy is access-restricted to equal the licenses or physical copies destroyed, then it falls under fair use.
- yonatan8070 10 hours ago
  
  I'm pretty sure every book I've seen has a page that says you're not allowed to copy/scan/photograph it.
  
  1 reply →
olalonde 10 hours ago
> That's why Anthropic switched to scanning paper books.
Could they not just subscribe to the academic publishers like universities do? Or buy eBooks? I don't understand how the "scanning" part is relevant here other than used physical books being cheaper perhaps?
- ascorbic 10 hours ago
  
  Bulk second-hand books are a lot cheaper than ebooks. Also not all books are available as ebooks, and ebooks have terms of service that presumably prevent them being used for training.
realusername 11 hours ago
If using the books is fair use, then distilling the model, which is just a derived product of those books is also fair use.
These companies are trying to have their cake and eat it too.
- drdaeman 10 hours ago
  
  Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).
  Quite unlikely, training on behavior purportedly approximately replicates the behavior. It gets replicated intentionally as a whole.
  IANAL, but I see significant differences with intent to copy a significant part as a whole into a competing product, surely shouldn’t fit under legal concept of fair use, no matter whether scanning books for LLM training fits or not.
  Whether such things (behaviors) are copyrightable - and should they be so - is another interesting question. Those aren’t algorithms or databases (stuff clearly and explicitly covered in many copyright laws), those are human expectation models, something like how we train animals or teach our own.
  
  4 replies →
- ascorbic 10 hours ago
  
  Probably, yes. It's likely just a breach in their terms of service. You'll note that they're not suing them – they're trying to get the government to do their work for them.
nicce 11 hours ago
In a different world it is not fair use. The benefits of the crime should be always taken off. If you isolate the training and pirating, you may say that it was fair, but that completely misses the point. The sole purpose of pirating (aka crime) was to train the models.
- ascorbic 10 hours ago
  
  Copyright infringement isn't usually a crime.
  
  1 reply →

zaptrem 11 hours ago

Should we require the destruction of the brains of those that watch pirated movies?

hmry 11 hours ago
Different situations call for different responses.
When someone steals a watch, we force them to give it back. Yet when someone steals a cake and eats it, we don't force them to puke it back up.
If you pirate a movie, the court might very well force you to delete all the copies you made of the movie you downloaded, destroy DVDs you burned, etc.
- raverbashing 11 hours ago
  
  Thanks for proving current copyright law makes no sense
  Here's a better idea, a fixed fee for any work. You can buy the license to read a book for $X (for whatever purpose) in RAND terms - of course publisher/material costs go on top, so if you're buying an actual book you're getting the material costs as well - or streaming fees or whatever
  
  1 reply →
TightFibre 10 hours ago

Well I enjoyed this response.
nicce 6 hours ago

Have we already agreed that AI is already equal to human life and not machine?