Comment by loudmax
3 months ago
Simon Willison had an analysis of Claude's system prompt back in May. One of the things that stood out was the effort they put in to avoiding copyright infringement: https://simonwillison.net/2025/May/25/claude-4-system-prompt...
Everyone knows that these LLMs were trained on copyrighted material, and as a next-token prediction model, LLMs are strongly inclined to reproduce text they were trained on.
All AI companies know they're breaking the law. They all have prompts effectively saying "Don't show that we broke the law!". That we continue to have tech companies consistently breaking the law and nothing happens is an indictment of our current economy.
And it's a question of do we accept breaking law for the possibility to have the greatest technological advancement of the 21st century. In my opinion, legal system has become a blocker for a lot of innovation, not only in AI but elsewhere as well.
This is a point that I don't see discussed enough. I think anthropic decided to purchase books in bulk, tear them apart to scan them, and then destroy those copies. And that's the only source of copyrighted material I've ever heard of that is actually legal to use for training LLMs.
Most LLMs were trained on vast troves of pirated copyrighted material. Folks point this out, but they don't ever talk about what the alternative was. The content industries, like music, movies, and books, have done nothing to research or make their works available for analysis and innovation, and have in fact fought industries that seek to do so tooth and nail.
Further, they use the narrative that people that pirate works are stealing from the artists, where the vast majority of money that a customer pays for a piece of copyrighted content goes to the publishing industry. This is essentially the definition of rent seeking.
Those industries essentially tried to stop innovation entirely, and they tried to use the law to do that (and still do). So, other companies innovated over the copyright holder's objections, and now we have to sort it out in the courts.
11 replies →
You’re willing to eliminate the entire concept of intellectual property for a possibility something might be a technological advancement? If creators are the reason you believe this advancement can be achieved, are you willing to provide them the majority of the profits?
6 replies →
Without agreeing or disagreeing with your view, I feel like the the issue the issue with that paradigm is inconsistency. If an individual "pirates", they get fines and possible jail time, but if a large enough company does it, they get rewarded by stockholders and at most a slap on the wrist by regulators. If as a society we've decided that the restrictions aren't beneficial, they should be lifted for everyone, not just ignored when convenient for large corporations. As it stands right now, the punishments are scaled inversely to the amount of damage that the one breaking the law actually is capable of doing.
> And it's a question of do we accept breaking law for the possibility to have the greatest technological advancement of the 21st century
You mean like, murder ?
The whole industry is based on breaking the law. You don’t get to be Microsoft, Google, Amazon, meta, etc without large amounts of illegality.
And the VC ecosystem and valuations are built around this assumption.
I don’t read this as “don’t show we broke the law,” I read it as “don’t give the user the false impression that there’s any legal issue with this generated content.”
There’s nothing law breaking about quoting publicly available information. Google isn’t breaking the law when it displays previews of indexed content returned by the search algorithm, and that’s clearly the approach being taken here.
Masked token prediction is reconstruction. It goes far beyond “quoting.”
This is incorrect. Two judges have now ruled that training on copyrighted data is fair use. https://www.whitecase.com/insight-alert/two-california-distr...
Training on copyright is not illegal. Even in the lawsuit against anthropic it was found to be fair use.
Pirating material is a violation of copyright, which some labs have done, but that has nothing to do with training AI and everything to do with piracy.
If my for profit/for sale product couldn't exist without inputting copyrighted works into it, then my product is derivative of those works. It's a pretty simple concept. No 'but human brains learn'. Humans aren't a corpo's for profit product.
'Would this product have the same value without the copyrighted works?'
If yes then it's not derivative. If no then it is.
There is US precedent for training being deemed not fair use. https://www.dglaw.com/court-rules-ai-training-on-copyrighted...
Why wouldn’t training be illegal? It’s illegal for me to acquire and watch movies or listen to songs without paying for them*. If consuming copyrighted material isn’t fair use, then it doesn’t make sense that AI training would be fair use.
* I hope it’s obvious but I feel compelled to qualify that, of course, I’m talking about downloading (for example torrenting) media, and not about borrowing from the library or being gifted a DVD, CD, book or whatever, and not listening/watching one time with friends. People have been successfully prosecuted for consuming copyrighted material, and that’s what I’m referring to.
3 replies →
> Training on copyright is not illegal.
The court decision this thread is about holds that it is, on the grounds that the training data was copied to the LLM's memory.
You can always vote, but there is always someone going through the back door paying politicians and judges.
and training on mountains of open source code with no attribution is exactly the same
the code models should also be banned, and all output they've generated subject to copyright infringement lawsuits
the sloppers (OpenAI, etc) may get away with it in the US, but the developed world has far more stringent copyright laws
and the countries that have massive industries based on copyright aren't about to let them evaporate for the benefit of a handful of US tech-bros
No thank you. I am perfectly fine with AI training on my open source code and it is perfectly legal because my open source code does not include a license that bans AI training.
1 reply →
post trained models strongly inclined to pass response similar to what got them high RL score, it's slightly wrong to keep thinking of LLMs as just next token predictions from dataset's probability distribution like it's some Markov Chain