Comment by runnig
11 hours ago
I'll just leave it here: "Anthropic's downloading of over seven million books from pirate sites like LibGen constituted infringement, the judge ruled, rejecting Anthropic's "research purpose" defense: "You can't just bless yourself by saying I have a research purpose and, therefore, go and take any textbook you want."
https://www.joneswalker.com/en/insights/blogs/ai-law-blog/wh...
Don't you find it funny that when you ask for song lyrics these models suddenly remember copyrighted material?
Some do, others decline to answer.
In the early days of music streaming, many of the entrants were seeding their service with vast libraries of pirated content. The winners cut deals with the copyright holders and then went after the rest.
Or the early days of video uploads, YouTube's most watched videos were "pirated" clips from popular shows (e.g. SpongeBob, The Daily Show) and part of the reason I went to YouTube instead of other video hosting sites (e.g. DailyMotion).
Viacom sued YouTube, while CBS and Universal ended up licensing their content.
https://www.eff.org/deeplinks/2007/03/viacom-v-google-invest...
They still are. My kids haven't watched a single Simpsons or Family Guy episode but are quoting both regularly.
Facebook et al also quite literally stole email contact lists and installed spyware at kernel level on mobile phones which they used to spy on all Android users. Via the phone manufacturers.
Yet they did not need to destroy the models which were trained with them?
Using them was allowed as fair use – it was the downloading of the pirated copies that was infringement. That's why Anthropic switched to scanning paper books.
> That's why Anthropic switched to scanning paper books.
After they threw away all the tainted data from the pirated books, right?
4 replies →
> Using them was allowed as fair use
That is only relevant in the US, and even there it is still not clear-cut whether the fair use doctrine applies on all these scenarios. Outside of the US the situation is also quite different: for example take a look at the recent ruling on GEMA vs OpenAI in Germany.
The reality is that the copyright issue with generative AI is very complex and reaching anything resembling a conclusion will take much more than a few opinion paragraphs from an American district judge.
Isn't scanning also a form of copyright infringement? You are making a digital copy of a book, which is the same thing as downloading a book from the internet...
16 replies →
> That's why Anthropic switched to scanning paper books.
Could they not just subscribe to the academic publishers like universities do? Or buy eBooks? I don't understand how the "scanning" part is relevant here other than used physical books being cheaper perhaps?
1 reply →
If using the books is fair use, then distilling the model, which is just a derived product of those books is also fair use.
These companies are trying to have their cake and eat it too.
6 replies →
In a different world it is not fair use. The benefits of the crime should be always taken off. If you isolate the training and pirating, you may say that it was fair, but that completely misses the point. The sole purpose of pirating (aka crime) was to train the models.
2 replies →
Should we require the destruction of the brains of those that watch pirated movies?
Different situations call for different responses.
When someone steals a watch, we force them to give it back. Yet when someone steals a cake and eats it, we don't force them to puke it back up.
If you pirate a movie, the court might very well force you to delete all the copies you made of the movie you downloaded, destroy DVDs you burned, etc.
2 replies →
Well I enjoyed this response.
Have we already agreed that AI is already equal to human life and not machine?
"You're trying to kidnap what I've rightfully stolen!"
How many “capabilities” did they “extract” from those books?
The capabilities of the books' writers to produce the text contained within them, which is exactly what Alibaba "extracted" from Claude. The point here is that Anthropic's framing as some sort of sophisticated technological attack is the ridiculous part. It's writing prompts and saving responses. We're all running "distillation attacks" on Claude, every day! Most of us just don't feed that stuff into a training corpus.
Exactly. Couldn't happen to better people. I'm pretty against piracy personally but if we find reliable ways to pirate Anthropic/OpenAI products in the future I'm all for it.