> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use
> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"
It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.
You skipped quotes about the other important side:
> But Alsup drew a firm line when it came to piracy.
> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."
That is, he ruled that
- buying, physically cutting up, physically digitizing books, and using them for training is fair use
- pirating the books for their digital library is not fair use.
As they mentioned, the piracy part is obvious. It's the fair use part that will set an important precedent for being able to train on copyrighted works as long as you have legally acquired a copy.
> - buying, physically cutting up, physically digitizing books, and using them for training is fair use
> - pirating the books for their digital library is not fair use.
That seems inconsistent with one another. If it's fair use, how is it piracy?
It also seems pragmatically trash. It doesn't do the authors any good for the AI company to buy one copy of their book (and a used one at that), but it does make it much harder for smaller companies to compete with megacorps for AI stuff, so it's basically the stupidest of the plausible outcomes.
> pirating the books for their digital library is not fair use.
"Pirating" is a fuzzy word and has no real meaning. Specifically, I think this is the cruz:
> without adding new copies, creating new works, or redistributing existing copies
Essentially: downloading is fine, sharing/uploading up is not. Which makes sense. The assertion here is that Anthropic (from this line) did not distribute the files they downloaded.
Aaron Swartz wanted to provide the public with open access to paywalled journal articles, while Anthropic want to use other people's copyrighted material to train their own private models that they restrict access to via a paywall. It's wild (but unsurprising) that Aaron Swartz was prosecuted under the CFAA for this while Anthropic is allowed to become commercially successful
Im not sure how I feel about what anthropic did on merit as a matter of scale, but from a legalistic standpoint how is it different from using the book to train the meat model in my head? I could even learn bits by heart and quote them in context.
Not sure about the law, but if you memorize and quote bits of a book and fail to attribute them, you could be accused of plagiarism. If for example you were a journalist or researcher, this could have professional consequences. Anthropic is building tools to do the same at immense scale with no concept of what plagiarism or attribution even is, let alone any method to track sourcing--and they're still willing to sell these tools. So even if your meat model and the trained model do something similar, you have a notably different understanding of what you're doing. Responsibility might ultimately fall to the end user, but it seems like something is getting laundered here.
That's only really applicable to evidence in criminal cases obtained by the government. No such doctrine exists for civil cases, for instance. It doesn't even bar the government from using evidence that others have collected illegally of their own volition.
> Here is how individuals are treated for massive copyright infringement:
When I clicked the link, I got an article about a business that was selling millions of dollars of pirated software.
This guy made millions of dollars in profit by selling pirated software. This wasn't a case of transformative works, nor of an individual doing something for themselves. He was plainly stealing and reselling something.
I wouldn't be so sure about that statement, no one has ruled on the output of Anthropic's AI yet. If their AI spits out the original copy of the book then it is practically the same as buying a book from them instead of the copyright holder.
We've only dealt with the fairly straight-forward legal questions so far. This legal battle is still far from being settled.
Anthropic isn’t selling copies of the material to its users though. I would think you couldn’t lock someone up for reading a book and summarizing or reciting portions of the contents.
Seven years for thumbing your nose at Autodesk when armed robbery would get you less time says some interesting things about the state of legal practice.
> summarizing or reciting portions of the contents
This absolutely falls under copyright law as I understand it (not a lawyer). E.g. the disclaimer that rolls before every NFL broadcast. The notice states that the broadcast is copyrighted and any unauthorized use, including pictures, descriptions, or accounts of the game, is prohibited. There is wiggle room for fair use by news organizations, critics, artists, etc.
I'm wondering though how the law will construe AI able to make a believable sequel to Moby Dick after digesting Herman Melville's works. (Or replace Melville with a modern writer.)
Except they aren’t merely reading and reciting content, are they? That’s a rather disingenuous argument to make. All these AI companies are high on billions in investment and think they can run roughshod over all rules in the sprint towards monetizing their services.
Make no mistake, they’re seeking to exploit the contents of that material for profits that are orders of magnitude larger than what any shady pirated-material reseller would make. The world looks the other way because these companies are “visionary” and “transformational.”
Maybe they are, and maybe they should even have a right to these buried works, but what gives them the right to rip up the rule book and (in all likelihood) suffer no repercussions in an act tantamount to grand theft?
There’s certainly an argument to be had about whether this form of research and training is a moral good and beneficial to society. My first impression is that the companies are too opaque in how they use and retain these files, albeit for some legitimate reasons, but nevertheless the archival achievements are hidden from the public, so all that’s left is profit for the company on the backs of all these other authors.
What point are you making? 20 years ago, someone sold pirated copies of software (wheres the transformation here) and that's the same as using books in a training set? Judge already said reading isnt infringement.
Aren't you comparing the wrong things? First example is about the output/outcome, what is the equivalent for LLMs? Also, not all "pirated" things are sold, most are in fact distributed for free.
"Pirates" also transform the works they distribute. They crack it, translate it, compress it to decrease download times, remove unnecessary things, make it easier to download by splitting it in chunks (essential with dial-up, less so nowadays), change distribution formats, offer it trough different channels, bundle extra software and media that they themselves might have coded like trainers, installers, sick chiptunes and so on. Why is the "transformation" done by a big corpo more legal in your views?
Can you explain why? What makes them categorically different or at the very least why is "piracy" quantitatively worse than 'just' copyright violation?
Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].
> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.
Crunchyroll was originally an anime piracy site that went legit and started actually licensing content later. They started in mid-2006, got VC funding in 2008, then made their first licensing deal in 2009.
And now Crunchyroll is owned by (through a lot of companies, like Aniplex of America, Aniplex, A1 Pictures) Sony, who produces a large amount of anime!
not just Spotify pretty much any (most?) current tech giant was build by
- riding a wave of change
- not caring too much about legal constraints (or like they would say now "distrupting" the market, which very very often means doing illigal shit which beings them far more money then any penalties they will ever face from it)
- or caring about ethics too much
- and for recent years (starting with Amazone) a lot of technically illegal financing (technically undercutting competitors prices long term based on money from else where (e.g. investors) is unfair competitive advantage (theoretically) clearly not allowed by anti monopoly laws. And before you often still had other monopoly issues (e.g. see wintel)
So yes not systematic not complying with law to get unfair competitive advantage knowing that many of the laws are on the larger picture toothless when applied to huge companies is bread and butter work of US tech giants
As you point out, they mostly did this before they were large companies (where the public choice questions are less problematic). Seems like the breaking of these laws was good for everybody.
"recording obtained unofficially" and "doesn't have rights to the recording" are separate things. So they could well have got a license to stream a publisher's music but that didn't come with an actual copy of some/all of the music.
The problem is that these "small things" are not necessarily small if you're an individual.
If you're an individual pirating software or media, then from the rights owners' perspective, the most rational thing to do is to make an example of you. It doesn't happen everyday, but it does happen and it can destroy lives.
If you're a corporation doing the same, the calculation is different. If you're small but growing, future revenues are worth more than the money that can be extracted out of you right now, so you might get a legal nastygram with an offer of a reasonable payment to bring you into compliance. And if you're already big enough to be scary, litigation might be just too expensive to the other side even if you answer the letter with "lol, get lost".
Even in the worst case - if Anthropic loses and the company is fined or even shuttered (unlikely) - the people who participated in it are not going to be personally liable and they've in all likelihood already profited immensely.
It's not a common business practice. That's why it's considered newsworthy.
People on the internet have forgotten that the news doesn't report everyday, normal, common things, or it would be nothing but a listing of people mowing their lawns or applying for business loans. The reason something is in the news is because it is unusual or remarkable.
"I saw it online, so it must happen all the time" is a dopy lack of logic that infects society.
Google Music originally let people upload their own digital music files. The argument at the time was that whether or not the files were legally obtained was not Google’s problem. I believe Amazon had a similar service.
This isn't as meaningful as it sounds. Nintendo was apparently using scene roms for one of the official emulators on Wii (I think?). Spotify might have received legally-obtained mp3s from the record companies that were originally pulled from Napster or whatever, because the people who work for record companies are lazy hypocrites.
The Nes classic console. The roms had an iNes emulator header lol.
And the playstation classic used an opensource ps1 emulator.
There was also some steam game ported from GameCube, and it had the Dolphin Emulator FPS counter in the corner of part of the trailer :D
I also remember reading that 2 of the PCSX2 devs ended up working on the EmotionEngine chip emulator for PS3 consoles with partial software emulation of PS2 (The CECH 02 and later models where they removed the EmotionEngine chip)
The common meme is that megacorps are shamelessly criminalistic organizations that get away with doing anything they can to maximize profits, while true in some regard, totally pales in comparison to the illegal things small businesses and start-ups do.
YouTube's initial success came from being able to serve, on a global scale, user-uploaded, largely uncredited copyright violations of both video and audio.
Facebook's "pivot to video" similarly relied on user-uploaded unlicensed video content, now not just pulling from television and film, but from content creators on platforms like YouTube.
Today, every "social" platform is now littered with "no copyright infringement intended" and "all credit to the original" copy-and-paste junk. Don't get me wrong, I'm a fan of remix culture – but I believe appropriating and monetizing the work of others without sharing the reward is a destructive cycle. And while there are avenues for addressing this, they're designed for the likes of Universal, Sony, Disney, etc. (I've had original recordings of original music flagged by megacorps because the applause triggered ContentID.)
AI slop further poisons the well. It's rough going out there.
You are missing the point. Spotify had permission from the copyright holders and/or their national proxies to use those songs in a limited beta in Sweden. They didn't have access to clean audio data directly from the record companies, so in many cases they used pirated rips instead.
What you really should be asking is whether they infringed on the copyrights of the rippers. /s
They had a second company (which I don't remember the name) that allowed users to backup and share their music.
When they were exposed they dug that as deep as they could
I know this might come as a shock to those living in San Francisco, but things are different in other parts of the world, like Uruguay, Sweden and the rest of Europe. From what I’ve read, the European committee actually cares about enforcing the law.
Someone on Twitter said: "Oh well, P2P mp3 downloads, although illegal, made contributions to the music industry"
That's not what's happening here. People weren't downloading music illegally and reselling it on Claude.ai. And while P2P networks led to some great tech, there's no solid proof they actually improved the music industry.
stealing with the intent to gain a unfair marked advantage so that you can effectively kill any ethically legally correctly acting company in a way which is very likely going to hurt many authors through the products you create is far worse then just stealing for personal use
> Stealing is stealing. Let's stop with the double standards.
I get the sentiment, but that statement as is, is absurdly reductive. Details matter. Even if someone takes merchandise from a store without paying, their sentence will vary depending on the details.
There's so many texts, and they're so sparse that if I could copyright a work and never publish it, the restriction would be irrelevant. The probability that you would accidentally come upon something close enough that copyright was relevant is almost infinitesimal.
Because of this copyright is an incredibly weak restriction, and that it is as weak as it is shows clearly that any use of a copyrighted work is due to the convenience that it is available.
That is, it's about making use of the work somebody else has done, not about that restricting you somehow.
Therefore copyright is much more legitimate than ordinary property. Ordinary property, especially ownership of land, can actually limit other people. But since copyright is so sparse infringing on it is like going to world with near-infinite space and picking the precise place where somebody has planted a field and deciding to harvest from that particular field.
Consequently I think copyright infringement might actually be worse than stealing.
oh well, the product has a cute name and will make someone a billionaire, let's just give it the green light. who cares about copyright in the age of AI?
Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?
The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.
I'm against Anthropic stealing teacher's work and discouraging them from ever writing again. Some teachers are already saying this (though probably not in California).
> The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.
I think this is a fantasy. My father cowrote a Springer book about physics. For the effort, he got like $400 and 6 author copies.
Now, you might say he got a bad deal (or the book was bad), but I don't think hundreds of thousands of authors do significantly better. The reality is, people overwhelmingly write because they want to, not because of money.
Training a generative model on a book is the mechanical equivalent of having a human read the book and learn from it. Is it stealing if a person reads the book and learns from it?
That will be sad, although there will still be plenty of great people who will write books anyway.
When it comes to a lot of these teachers, I'll say, copyright work hand in hand with college and school course book mandates. I've seen plenty of teachers making crazy money off students' backs due to these mandates.
A lot of the content taught in undergrad and school hasn't changed in decades or even centuries. I think we have all the books we'll ever need in certain subjects already, but copyright keeps enriching people who write new versions of these.
They won't be needed anymore, once singularity is reached. This might be their thought process. This also exemplifies that the loathed caste system found in India is indeed in place in western societies.
There is no equality, and seemingly there are worker bees who can be exploited, and there are privileged ones, and of course there are the queens.
> 150K per work is the maximum fine for willful infringement
No, its not.
It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.
Of course, there's also a very wide range of statutory damages, the minimum (if it is not "innocent" infringement) is $750/work.
> 105B+ is more than Anthropic is worth on paper.
The actual amount of 7 million works times $150,000/work is $1.05 trillion, not $105 billion.
Even if they don't qualify for willful infringement damages (lets say they have a good faith belief their infringement was covered by fair use) the standard statutory damages for copyright infringement are $750-$30,000 per work.
Isn't "pirating" a felony with jail time, though? That's what I remember from the FBI warning I had to see at the beginning of every DVD I bought (but not "pirated" ones).
Just downloading them is of course cheaper, but it is worth pointing out that, as the article states, they did also buy legitimate copies of millions of books. (This includes all the books involved in the lawsuit.) Based on the judgement itself, Anthropic appears to train only on the books legitimately acquired. Used books are quite cheap, after all, and can be bought in bulk.
Buying a book is not license to re-sell that content for your own profit. I can't buy a copy of your book, make a million Xeroxes of it and sell those. The license you get when you buy a book is for a single use, not a license to do what ever you want with the contents of that book.
If you wanted to be legit with 0 chance of going to court, you would contact publisher and ask to pay a license to get access to their catalog for training, and negotiate from that point.
This is what every company using media are doing (think Spotify, Netflix, but also journal, ad agency, ...). I don't know why people in HN are giving a pass to AI company for this kind of behavior.
> I don't know why people in HN are giving a pass to AI company for this kind of behavior.
As mentioned in The Fucking Article, there's a legal difference between training an AI which largely doesn't repeat things verbatim (ala Anthropic) and redistributing media as a whole (ala Spotify, Netflix, journal, ad agency).
This is not about paying for a single copy. It would still be wrong even if they have bought every single one of those books.
It is a form of plagiarism. The model will use someone else's idea without proper attribution.
Legally speaking, we don't know that yet. Early signs are pointing at judges allowing this kind of crap because it's almost impossible for most authors to point out what part of the generated slop was originally theirs.
But should the purchase be like a personal license? Or like a commercia license that costs way more?
Because for example if you buy a movie on disc, that's a personal license and you can watch it yourself at home. But you can't like play it at a large public venue that sell tickets to watch it. You need a different and more expensive license to make money off the usage of the content in a larger capacity like that.
And the crazy thing is that might be cheaper when you consider the alternative is to have your lawyers negotiate with the lawyers for the publishing companies for the right to use the works as training data. Not only is it many many billable hours just to draw up the contract, but you can be sure that many companies would either not play ball or set extremely high rates. Finally, if the publishing companies did bring a suit against Anthropic they might be asked to prove each case of infringement, basically to show that a specific work was used in training, which might be difficult since you can't reverse a model to get the inputs. When you're a billion dollar company it's much easier to get the courts to take your side. This isn't like the music companies suing teenagers who had a Kazaa account.
These are the people shaping the future of AI? What happened to all the ethical values they love to preach about?
We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?
> We've held China accountable for counterfeiting products for decades and regulated their exports
We have? Are we from different multi-verses?
The one I've lived in to date has not done anything against Chinese counterfeits beyond occasionally seizing counterfeit goods during import. But that's merely occasionally enforcing local counterfeit law, a far cry from punishing the entity producing it.
As a matter of fact, the companies started outsourcing everything to China, making further IP theft and quasi-copies even easier
I was gonna say, the enforcement is so weak that it's not even really worth it to pursue consumer hardware here in the US. Make product that is a hit, patent it, and still 1 month later IYTUOP will be selling an identical copy for 1/3rd the price on Amazon.
IP theft is one of the stated reasons for the trade war in the first place. It’s one of the major gripes the US has against China. There are limited means available to restrict a foreign country compared with an entity in the US. The DoJ did sue Huawei and win though.
Whether or not the countermeasures have been effective in practice is a minor detail in the GP point that we would not expect an American company headquartered in the US and conducting significant business in the US to get away with the same thing.
You never noticed the hypocrite behavior all over society?
* O, you drunk drive, big fine, lots of trouble.
* O, you drunk drive and are a senator, cop, mayor, ... Well, lets look the other way.
* You have anger management issues and slam somebody to the ground. Jail time.
* You as a cop have anger management issues and slams somebody to the ground. Well, paid time off while we investigate and maybe a reprimand. Qualified immunity boy!
* You tax fraud for 10k, felony record, maybe jail time.
* You as a exec of a company do tax fraud for 100 million. After 10 years lawyering around, maybe you get something, maybe, ... o, here is a fine of 5 million.
I am sorry but the idea of everybody being equal under the law has always been a illusion.
We are holding China accountable for counterfeiting products because it hurts OUR companies, and their income. But when its "us vs us", well, then it becomes a bit more messy and in general, those with the biggest backing (as in $$$, economic value, and lawyers), tends to win.
Wait, if somebody steal my book, i can sue that person in court, and get a payout (lawyers will cost me more but that is not the point). If some AI company steals my book, well, the chance you win is close to 1%, simply because lots of well paid lawyers will make your winning hard to impossible.
Our society has always been based upon power, wealth and influence. The more you have of it, the more you get away (or reduced) with things, that gets other fined or jailed.
Why is it unethical of them to use the information in all these books? They are clearly not reselling the books in any way, shape, or form. The information itself in a book can never be copyrighted. You can also publish and sell material where you quote other books within it.
If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.
Something missed in arguments such as these is that in measuring fair use there's a consideration of impact on the potential market for a rightsholder's present and future works. In other words, can it be proven that what you are doing is meaningfully depriving the author of future income.
Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.
On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.
Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.
> a consideration of impact on the potential market for a rightsholder's present and future works
This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.
As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?
We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.
The core problem here is that copyright already doesn't actually follow any consistent logical reasoning. "Information wants to be free" and so on. So our own evaluation of whether anything is fair use or copyrighted or infringement thereof is always going to be exclusively dictated by whatever a judge's personal take on the pile of logical contradictions is. Remember, nominally, the sole purpose of copyright is not rooted in any notions of fairness or profitability or anything. It's specifically to incentivize innovation.
So what is the right interpretation of the law with regards to how AI is using it? What better incentivizes innovation? Do we let AI companies scan everything because AI is innovative? Or do we think letting AI vacuum up creative works to then stochastically regurgitate tiny (or not so tiny) slices of them at a time will hurt innovation elsewhere?
But obviously the real answer here is money. Copyright is powerful because monied interests want it to be. Now that copyright stands in the way of monied interests for perhaps the first time, we will see how dedicated we actually were to whatever justifications we've been seeing for DRM and copyright for the last several decades.
Everything is different at scale. I'm not giving a specific opinion on copyright here, but it just doesn't make sense when we try to apply individual rights and rules to systems of massive scale.
I really think we need to understand this as a society and also realize that moneyed interests will downplay this as much as possible. A lot of the problems we're having today are due to insufficient regulation differentiating between individuals and systems at scale.
The difference here is that an LLM is a mechanical process. It may not be deterministic (at least, in a way that my brain understands determinism), but it's still a machine.
What you're proposing is considering LLMs to be equal to humans when considering how original works are created. You could make the argument that LLM training data is no different from a human "training" themself over a lifetime of consuming content, but that's a philosophical argument that is at odds with our current legal understanding of copyright law.
That's not a philosophical argument at odds with our current understanding of copyright law. That's exactly what this judge found copyright law currently is and it's quoted in the article being discussed.
How is it stolen from Business Insider? When I visit businessinsider.com/anthropic-cut-pirated-millions-used-books-train-claude-copyright-2025-6 I get the same story. My browser caches the story, and I save it for archival purposes. How is this theft?
BI decides who can access this content and who will get the paywall. The link to archive page allows people to access this content without permission. That’s called stealing.
Buying, scanning, and discarding was in my proposal to train under copyright restrictions.
You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.
This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.
That's true and was the distinction I was making. In my proposal, and maybe part of what Anthropic did, the digitized copies are used as training data for a new work, the model. That reduces the risk of legal rulings against using the copyrighted works.
From there, the cases would likely focus on whether that fits in established criteria for digitized copies, whether they're allowed in the training process itself, and the copyright status of the resulting model. Some countries allow all of that if you legally obtained the material in the first place. Also, they might factor whether it's for commercial use or not.
I'm not seeing how this is fair use in either case.
Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?
It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.
Copyright is largely about distributing copies. It’s not about making something vaguely similar or about referencing copyrighted work to make something vaguely similar.
Although, there’s an exception for fictional characters:
Copyright is not on “information”, It’s on the tangible expression (i.e., the actual words). “Transformative use” is a defense in copyright infringement.
Digitizing the books is the equivalent of a blind person doing something to the book to make it readable to them... the software can't read analog pages.
Learning from the book is, well, learning from the book. Yes, they intended to make money off of that learning... but then I guess a medical student reading medical textbooks intends to profit off of what they learn from them. Guess that's not fair use either (well, it's really just use, as in the intended use for all books since they were first invented).
Once a person has to believe that copyright has any moral weight at all, I guess all rational though becomes impossible for them. Somehow, they're not capable of entertaining the idea that copyright policy was only ever supposed to be this pragmatic thing to incentivize creative works... and that whatever little value it has disappears entirely once the policy is twisted to consolidate control.
Copyright isnt a digital moat. Its largely an agreement that the work is available to the public, but the creator has a limited amount of time to exploit it at market.
If you sell an AI model, or access to an AI model, theres usually around 0% of the training data redistributed with the model. You cant decompile it and find the book. As you aren't redistributing the original work copyright is barely relevant.
Imagine suggesting that because you own the design of a hammer, that all works created with the hammer belong to you and cant be sold?
That someone came up with a new method of using books as a tool to create a different work, does not entitle the original book author to a cut of the pie.
Available to the public is one thing, but a for-profit company is not "the public". They are providing a service that makes that work, regardless of what ever form it is in, available to the public. This seems like a middle man situation that makes a profit off of access to information, regardless of what form it is in.
> to make a profit off of the information that is included in these works?
Isn't that what a lot of companies are doing, just through employees? I read a lot of books, and took a lot of courses, and now a company is profiting off that information.
They clearly were being digitized, but I think its a more philosophical discussion that we're only banging our heads against for the first time to say whether or not it is fair use.
Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings. Digitization is just memory. If the models cannot think then it is meaningless digital regurgitation and plagiarism, not to mention breach of copyright.
The quotes "consistent with copyright's purpose in enabling creativity and fostering scientific progress." and "Like any reader aspiring to be a writer" say, from what I can tell, that the judge has legally ruled the model can think as a human does, and therefore has the legal protections afforded to "creatives."
In my mind, there is a difference between a person using there own creative thinking to create a derivative work from learning about a subject and making money off of it versus a corporation with a language model that is designed to absorb the works of the entire planet and redisrubtes that information in away that puts them in a centralized position to become an authority on information. With a person, there is a certain responsibility one has to create meaning from that work so that others can experience it. For-profit companies are like machines that have no interest in the creative expression part of this process hence there is a concern that they do not have the best interests of the public at heart.
> Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings.
No, that's fallacious. Using anthropomorphic words to describe a machine does not give it the same kinds of rights and affordances we give real people.
What do you think fair use is? The whole point of the fair use clauses is that if you transform copyrighted works enough you don't have to pay the original copyright holder.
Fair use is not, at its core, about transformation. It's about many types of uses that do not interfere with the reasons for the rights we ascribe to authors. Fair use doesn't require transformation.
There is another case where companies slurped up all of the internet and profited off the information, that makes a good comparison - search engines.
Judges consider a four factor when examining fair use[1]. For search engines,
1) The use is transformative, as a tool to find content is very different purpose than the content itself.
2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.
3) The search engine store significant portions of the work in the index, but it only redistributes small portions.
4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.
So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.
Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.
So now lets look at LLMs:
1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.
2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)
3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.
4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.
This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.
I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.
Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use, but the legal industry disagrees.
Which of the following are true?
(a) the legal industry is susceptible to influence and corruption
(b) engineers don't understand how to legally interpret legal text
(c) AI tech is new, and judges aren't technically qualified to decide these scenarios
Most likely option is C, as we've seen this pattern many times before.
There's a lot of conflation of "should/shouldn't" and "is/isn't". The comments by tech folk you're alluding to mostly think that it "shouldn't" be fair use, out of concern about the societal consequences, whereas judges are looking at it and saying that it "is" fair use, based on the existing law.
Any reasonable reading of the current state of fair use doctrine makes it obvious that the process between Harry Potter and the Sorcerer's Stone and "A computer program that outputs responses to user prompts about a variety of topics" is wildly transformative, and thus the usage of the copyrighted material is probably covered by fair use.
> Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use
Where are you getting your data from? My conclusions are the exact opposite.
(Also, aren't judges by definition the only ones qualified to declare if it is actually fair use? You could make a case that it shouldn't be fair use, but that's different from it being not fair use.)
Armchair commentators, including myself, tend to be imprecise when speaking about whether something is illegal, versus something should be illegal. Sometimes due to a misunderstanding of the law, or an over-estimation of the court's authority, or an over-estimation of our legislature's productivity, or just because we're making conversation and like talking.
I don't understand at all the resistance to training LLMs on any and all materials available. Then again, I've always viewed piracy as a compatible with markets and a democratizing force upon them. I thought (wrongly?) that this was the widespread progressive/leftist perspective, to err on the side of access to information.
It's not likely you've actually gotten the opinion of the "majority of tech folks", just the most outspoken ones, and only in specific bubbles you belong to.
I know for sure (b) is true. Way too many people on technical forums read legal texts as if the process to interpret laws is akin to a compiler generating a binary.
Alsup detailed Anthropic's training process with books: The OpenAI rival
spent "many millions of dollars" buying used print books, which the
company or its vendors then stripped of their bindings, cut the pages,
and scanned into digital files.
I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.
I don’t think that’s gonna happen. I think they will manage to get themselves out of trouble for it, while the rest of us will still face serious problems if we are caught torrenting even one singular little book.
It's already quite widespread and likely legal for average people to train AI models on copyrighted material, in the open weight AI communities like SD and LocalLLaMa.
Please, please differentiate between pirating books (which Anthrophic is liable for, and is still illegal) and training on copyrighted material (which was found to be legal, for both corporations and average people).
Even so, would be hard to prove that this particular little book wasn't generated by Claude (oopsie, it happens to be a verbatim copy of a copyrighted work, that happens sometimes, those pesky LLMs).
It would be great, but I think some are worried that new AI BigTech will find a way to continue enforcing IP on the rest of society while it won't exist for them
What are your feelings about how the small fish is stripped of their arts, and their years of work becomes just a prompt? Mainly comic artists and small musicians who are doing things they like and putting out for people, but not for much money?
>Mainly comic artists and small musicians who are doing things they like and putting out for people, but not for much money?
The number of these artists I have seen receiving some bogus DMCA takedown notice for fan art is crazy.
I saw a bloke give away some of his STL's because he received a takedown request from games workshop and didnt have the funds to fight it.
Its not that I want small artists to lose, its that I want them to gain access to every bloody copyright and trademark so they are more free to create.
Shit Conde Nast managed to pull something like 400 pulps off the market, so they didnt interfere with their newly launched James Patterson collaborations.
It's true that intellectual property is a flawed and harmful mechanism for supporting creative work, and it needs to change, but I don't think ensuring a positive outcome is as simple as this. Whether or not such a power struggle between corporate interests benefits the public rather than just some companies will be largely accidental.
I do support intellectual property reform that would be considered radical by some, as I imagine you do. But my highest hopes for this situation are more modest: if AI companies are told that their data must be in the public domain to train against, we will finally have a powerful faction among capitalists with a strong incentive to push back against the copyright monopolists when it comes to the continuous renewal of copyright terms.
If the "path of least resistance" for companies like Google, Microsoft, and Meta becomes enlarging the public domain, we might finally begin to address the stagnation of the public domain, and that could be a good thing.
But I think even such a modest hope as that one is unlikely to be realized. :-\
Copyleft nullifies copyright. Abolishing copyright and adding right to repair laws (mandatory source files) would give the same effect as everyone using copylefted licenses.
By the way I wonder if recent advancement in protecting Youtube videos from downloaders like yt-d*p are caused by unwillingness to help rival AI companies gather the datasets.
I'm hoping they fail to incentivize using legal, open, and/or licensed data. Then, thry might have to attempt to train a Claude-class model on legal data. Then, I'll have a great, legal model to use. :)
I was 100% thinking this. GREAT book. And they, too, shredded books to ingest them into the digital library! I don't recall if it was an attempt to bypass copyright though; in Rainbow's End, it was more technical, as it was easier to shred, scan the pieces, and reassemble them in software, rather than scanning each page.
"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."
A not-so-subtle difference.
That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.
Yeah, I'm not sure if people realize that the whole reason they had to cut up the books was because they wanted to comply with copyright law. Artificial scarcity.
You're not wrong, but that's one heck of a way to do it. It involves the destruction of 7 million books, which ... I really don't quite see the "promotion of Progress of Science and useful Arts" in that.
If ingesting books into an AI makes Anthropic criminals, then Google et al are also criminals alike for making search indexes of the Internet. Anything published online is equally copyrighted.
Exactly! If Anthropic is guilty of copyright infringement for the mere act of downloading copyrighted books then so is Google, Microsoft (Bing), DuckDuckGo, etc. Every search engine that exists downloads pirated material every day. They'd all be guilty.
Not only that but all of us are guilty too because I'm positive we've all clicked on search results that contained copyrighted content that was copied without permission. You may not have even known it was such.
Remember: Intent is irrelevant when it comes to copyright infringement! It's not that kind of law.
Intent can guide a judge when they determine damages but that's about it.
Yeah, we can all agree that ingesting books is fair use and transformative, but you gotta own what you ingest, you can't just pirate it.
I can read 100 books and write a book based on the inspiration I got from the 100 books without any issue. However, if I pirate the 100 books I've still committed copyright infringement despite my new book being fully legal/fair use.
I disagree that it has anything to do with copyright. It is at most theft. If I steal a bunch of books from the library, I haven't committed any breach of copyright.
something i've been trying to reconcile: i buy a cheap used book on biblio and i'm morally ok even though the writer doesn't get paid. but if i pirate the book, then i'm wrong for that because the writer doesn't get paid?
Based on the fact people went to jail for downloading some music or movies, this guy will face a lifetime in prison for 7 million books that he then used for commercial profit right?
Right guys we don't have rules for thee but not for me in the land of the free?
Same did Meta and probably other big companies. People who praise AGI are very short sighted. It will ruin the world with our current morals and ethics. It's like a nuclear weapon in the hands of barbarians (shit, we have that too, actually).
The main problem I have with that argument is when have we ever been "ready" for any technology ahead of times? Its impact is always unknown not only on society but in the technology itself. What sort of 'preparatory' work can a society do to satisfy this? If we were to apply such 'precautionary' logic to automobiles it would at very best be starting out with a large public works project to create separated grades for all vehicles, before building a single vehicle (where they have no idea how tight the constraints will truly be, thinking they will go only as fast as a horse at a gallop). It is far more likely to wind up a silly waste of time or pretext for Luddism to try to delay the inevitable. Like it or not, being capable of doing something is the only true deciding standard of being ready for something.
Not to mention the whole notion of being able to judge others as 'not ready for it' is an insult to the very notion of individual self-determination. Imagine for instance if the western world in the past had took after Starfleet in the worst of ways and banned supplying medical aid to sub-Saharan Africa as they judged their society as not ready for it. They would rightfully be called callous, arrogant racist imperialists for thinking it is their right to impose suffering upon others and deny others opportunities and self-determination because of their own parochial judgement and thinking they knew better! Putting oneself in the position to be able to judge for the world is an act of hubris far greater than that they project upon the attempted inventors of AGI.
So using the standard industry metrics for calculating the financial impact of piracy, this would equate to something like trillions of damages to the book publishing industry?
If AI companies are allowed to use pirated material to create their products, does it mean that everyone can use pirated software to create products? Where is the line?
Also please don't use word "learning", use "creating software using copyrighted materials".
Also let's think together how can we prevent AI companies from using our work using technical measures if the law doesn't work?
It's abusive and wrong to try and prevent AI companies from using your works at all.
The whole point of copyright is to ensure you're paid for your work. AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.
If that LLM reproduces your work, then the AI company is violating copyright, but if the LLM doesn't reproduce your work, then you have not been harmed. Trying to claim harm when you haven't been due to some philosophical difference in opinion with the AI company is an abuse of the courts.
> It's abusive and wrong to try and prevent AI companies from using your works at all.
People don't view moral issues in the abstract.
A better perspective on this is the fact that human individuals have created works which megacorps are training on for free or for the price of a single book and creating models which replace individuals.
The megacorps are only partially replacing individuals now, but when the models get good enough they could replace humans entirely.
When such a future happens will you still be siding with them or with individual creators?
> The whole point of copyright is to ensure you're paid for your work.
No. The point of copyright is that the author gets to decide under what terms their works are copied. That's the essence of copyright. In many cases, authors will happily sell you a copy of their work, but they're under no obligation to do so. They can claim a copyright and then never release their work to the general public. That's perfectly within their rights, and they can sue to stop anybody from distributing copies.
Current copyright law is not remotely sophisticated enough to make determinations on AI fair use. Whether the courts say current AI use is fair is irrelevant to the discussion most people on this side would agree with: That we need new laws. The work the AI companies stole to train on was created under a copyright regime where the expectation was that, eh, a few people would learn from and be inspired from your work, and that feels great because you're empowering other humans. Scale does not amplify Good. The regime has changed. The expectations under what kinds of use copyright protects against has fundamentally changed. The AI companies invented New Horrors that no one could have predicted, Vader altered the deal, no reasonable artist except the most forward-thinking sci-fi authors would have remotely guessed what their work would be used for, and thus could never have conciously and fairly agreed to this exchange. Very few would have agreed to it.
It is not wrong at all. The author decides what to do with their work. AI companies are rich and can simply buy the rights or hire people to create works.
I could agree with exceptions for non-commercial activity like scientific research, but AI companies are made for extracting profits and not for doing research.
> AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.
It doesn't work this way. If you buy a movie it doesn't mean you can sell goods with movie characters.
> then you have not been harmed.
I am harmed because less people will buy the book if they can simply get an answer from LLM. Less people will hire me to write code if an LLM trained on my code can do it. Maybe instead of books we should start making applications that protect the content and do not allow copying text or making screenshots. ANd instead of open-source code we should provide binary WASM modules.
You are allowed to buy and scan books, and then used those scanned books to create products. I guess you are also allowed to pirate books and use the knowledge to create products if you are willing to pay the damages to the rights holders for copyright violations.
Let’s say my AI company is training an AI on woodworking books and at the end, it will describe in text and wireframe drawings (but not the original or identical photos) how to do a particular task.
If I didn’t license all the books I trained on, am I not depriving the publisher of revenue, given people will pay me for the AI instead of buying the book?
Copyright doesn’t cover facts and methods. It specifically covers creative expressions. That’s why patents are different from copyright. If you read some woodworking books and then write your own online tutorial about building a chair using the methods and procedures described in that book, it doesn’t matter that you now compete with the books that you used, provided you didn’t copy the creative elements. How much of a chair design is creative and how much is function is an ambiguous question that might still land you in court, but it won’t be over your right to make the tutorial in the first place.
As the judge noted in this ruling, copyright isn’t intended to protect authors from competition. Copyright doesn’t protect Rowling from other authors writing YA wizard books cutting into her revenue streams. Or from TV producers making YA wizard shows that reduce the demand for books. Copyright doesn’t protect the Tolkien estate from Terry Brooks, or Tracy Hickman or Margret Weiss reducing the demand for Tolkien fantasy by supplanting it with their own fantasies.
The farce of treating a corporation as an individual precludes common sense legal procedure to investigate people who are responsible for criminal action taken by the company. Its obviously premeditated and in all ways an illicit act knowingly perpetrated by persons. The only discourse should be about upending this penthouse legalism.
The “farce” of treating a corporation as a legal individual is the reason you can have this case in the first place. Otherwise the authors would have had to discover and individually sue each specific individual in the company for each specific claim. They would have to find the specific individual that downloaded their specific book and sue that person. Then they would need to find the specific individual that digitized their specific book and sue that person. Then they would need to find the specific person that loaded that digital copy into an AI model and sue that person. And on and on for each alleged act of infringement.
Or we could recognize that’s silly when we’re talking about a group of people acting in concert and treat them as a single entity for the purpose of alleged crimes. Which is what we do when we treat a corporation as an individual for legal purposes.
The irony is that actually litigating copyright law would lead to the repeal of said copyright law. And so in all cases of backwaters laws that are used to "protect interests" of "corporations" yet criminalize petty individual cases.
This of course cannot be allowed to happen, so the the legal system is just a limbo, a bar which regular individuals must strain to pass under but that corporations regularly overstep.
I've begun to wonder if this is why some large torrent sites haven't been taken down. They are essentially able to crowdsource all the work. There are some users who spend ungodly amounts of time and money on these sites that I suspect are rich industry benefactors.
If Anthropic is funded by Amazon, they should have just asked Amazon for unlimited download of EVERY book in the Amazon book store, and all audio-books as well. It certainly would be faster than buying one copy of each and tearing it apart.
Hang on, it is OK under copyright law to scan a book I bought second hand, destroy the hard copy and keep the scan in my online library? That doesn't seem to chime with the copyright notices I have read in books.
First sale doctrine gives the person who sold the book you bought the right to sell it to you. Fair Use permits you to scan your copy, used or new. It's your book, you can destroy it. But you have to delete your digital copy if you sell it or give it away. And you can't distribute your digital copy.
> That doesn't seem to chime with the copyright notices I have read in books.
I used to get scared by such verbiage. Courts ruled decades ago that many of those uses are actually permitted, under very common conditions (e.g. not distributing, etc). Yes, you totally can photocopy a book you own, for your own purposes.
The article doesn't say who is suing them. Is it a class action? How many of these 7M pirated books have they written? Is it publishing houses? How many of these books are relevant in this judgement?
as far as I understand while training on books is clearly not fair use (as the result will likely hurt the lively hood of authors, especially not "best of the best" authors).
as long as you buy the book it still should be legal, that is if you actually buy the book and not a "read only" eBook
but the 7_000_000 pirated books are a huge issue, and one from which we have a lot of reason to believe isn't just specific to Anthropic
Buying a copy of a book does not give you license to take the exact content of that book, repackage it as a web service, and sell it to millions of others. That's called theft.
> "Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different," he wrote.
But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it. And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material. Also people do not read millions of books to become a writer.
> But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it.
The analogy refers to humans using machines to do what would already be legally if they did it manually.
> And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material.
[Citation needed], and not a legal argument.
> Also people do not read millions of books to become a writer.
> But people do hear millions of words as children.
At a rate 1000 words/day it takes 3 years to hear a million words. Also "million words" is not equal to "million books". Humans are ridiculously efficient in learning compared to LLMs.
"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling
If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.
The solution has always been: show us the training data.
As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.
On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.
Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.
My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.
[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).
[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!
> In fact this business was the ultimate in deconstruction: First one and then the other would pull books off the racks and toss them into the shredder's maw. The maintenance labels made calm phrases of the horror: The raging maw was a "NaviCloud custom debinder." The fabric tunnel that stretched out behind it was a "camera tunnel...." The shredded fragments of books and magazine flew down the tunnel like leaves in tornado, twisting and tumbling. The inside of the fabric was stitched with thousands of tiny cameras. The shreds were being photographed again and again, from every angle and orientation, till finally the torn leaves dropped into a bin just in front of Robert. Rescued data. BRRRRAP! The monster advanced another foot into the stacks, leaving another foot of empty shelves behind it.
Everybody that wants to train an LLM, should buy every single book, every single issue of a magazine or a newspaper, and personally ask every person that ever left a comment on social media. /s
If I was China I would buy every lawyer to drown western AI companies in lawsuits, because it's an easy way to win AI race.
It's even worse considering all he downloaded was in public domain so it was much less problematic considering copyright.
Lesson is simple. If you want to break a law make sure it is very profitable because then you can find investors and get away with it. If you play robin hood you will be met with a hammer.
Amazon has been doing this since the 2000's. Fun fact: This is how AWS came about; for them to scale its "LOOK INSIDE!" feature for all the books it was hoovering in an attempt to kill the last benefit the bookstore had over them.
Ie. This is not a big deal. The only difference now is ppl are rapidly frothing to be outraged by the mere sniff of new tech on the horizon. Overton window in effect.
https://archive.md/YLyPg
The important parts:
> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use
> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"
It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.
You skipped quotes about the other important side:
> But Alsup drew a firm line when it came to piracy.
> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."
That is, he ruled that
- buying, physically cutting up, physically digitizing books, and using them for training is fair use
- pirating the books for their digital library is not fair use.
> buying, physically cutting up, physically digitizing books, and using them for training is fair use
So Suno would only really need to buy the physical albums and rip them to be able to generate music at an industrial scale?
59 replies →
As they mentioned, the piracy part is obvious. It's the fair use part that will set an important precedent for being able to train on copyrighted works as long as you have legally acquired a copy.
3 replies →
So all they have to do is go and buy a copy of each book they pirated. They will have ceased and desisted.
71 replies →
> That is, he ruled that
> - buying, physically cutting up, physically digitizing books, and using them for training is fair use
> - pirating the books for their digital library is not fair use.
That seems inconsistent with one another. If it's fair use, how is it piracy?
It also seems pragmatically trash. It doesn't do the authors any good for the AI company to buy one copy of their book (and a used one at that), but it does make it much harder for smaller companies to compete with megacorps for AI stuff, so it's basically the stupidest of the plausible outcomes.
10 replies →
> You skipped quotes about the other important side:
He said:
> It was always somewhat obvious that pirating a library would be copyright infringement.
??
From my understanding:
> pirating the books for their digital library is not fair use.
"Pirating" is a fuzzy word and has no real meaning. Specifically, I think this is the cruz:
> without adding new copies, creating new works, or redistributing existing copies
Essentially: downloading is fine, sharing/uploading up is not. Which makes sense. The assertion here is that Anthropic (from this line) did not distribute the files they downloaded.
13 replies →
I don't think that's new. google set precedent for that more than a decade ago. you're allowed to transform a book to digital.
How times change .They wanted to lock up Aaron Schwartz for life for essentially doing the same thing Anthropic is doing.
Aaron Swartz wanted to provide the public with open access to paywalled journal articles, while Anthropic want to use other people's copyrighted material to train their own private models that they restrict access to via a paywall. It's wild (but unsurprising) that Aaron Swartz was prosecuted under the CFAA for this while Anthropic is allowed to become commercially successful
AFAIK, Judge Vince Chhabria has countered that Fair Use argument in a later order involving Meta.
https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...
Note: I am not a lawyer.
Im not sure how I feel about what anthropic did on merit as a matter of scale, but from a legalistic standpoint how is it different from using the book to train the meat model in my head? I could even learn bits by heart and quote them in context.
Not sure about the law, but if you memorize and quote bits of a book and fail to attribute them, you could be accused of plagiarism. If for example you were a journalist or researcher, this could have professional consequences. Anthropic is building tools to do the same at immense scale with no concept of what plagiarism or attribution even is, let alone any method to track sourcing--and they're still willing to sell these tools. So even if your meat model and the trained model do something similar, you have a notably different understanding of what you're doing. Responsibility might ultimately fall to the end user, but it seems like something is getting laundered here.
Machines do not have rights belonging to human now.
Feels like information laundering to me.
Is fruit of the poisonous tree rule applicable here?
That's only really applicable to evidence in criminal cases obtained by the government. No such doctrine exists for civil cases, for instance. It doesn't even bar the government from using evidence that others have collected illegally of their own volition.
Here is how individuals are treated for massive copyright infringement:
https://investors.autodesk.com/news-releases/news-release-de...
I thought you'd go with this: https://en.wikipedia.org/wiki/United_States_v._Swartz
Swartz wasn't charged with copyright infringement.
8 replies →
> Here is how individuals are treated for massive copyright infringement:
When I clicked the link, I got an article about a business that was selling millions of dollars of pirated software.
This guy made millions of dollars in profit by selling pirated software. This wasn't a case of transformative works, nor of an individual doing something for themselves. He was plainly stealing and reselling something.
> illegally copying and selling pirated software
This is very different to what Anthropic did. Nobody was buying copies of books from Anthropic instead of the copyright holder.
I wouldn't be so sure about that statement, no one has ruled on the output of Anthropic's AI yet. If their AI spits out the original copy of the book then it is practically the same as buying a book from them instead of the copyright holder.
We've only dealt with the fairly straight-forward legal questions so far. This legal battle is still far from being settled.
5 replies →
At the very least, they should have purchased the originals once
8 replies →
Anthropic isn’t selling copies of the material to its users though. I would think you couldn’t lock someone up for reading a book and summarizing or reciting portions of the contents.
Seven years for thumbing your nose at Autodesk when armed robbery would get you less time says some interesting things about the state of legal practice.
> summarizing or reciting portions of the contents
This absolutely falls under copyright law as I understand it (not a lawyer). E.g. the disclaimer that rolls before every NFL broadcast. The notice states that the broadcast is copyrighted and any unauthorized use, including pictures, descriptions, or accounts of the game, is prohibited. There is wiggle room for fair use by news organizations, critics, artists, etc.
3 replies →
I'm wondering though how the law will construe AI able to make a believable sequel to Moby Dick after digesting Herman Melville's works. (Or replace Melville with a modern writer.)
1 reply →
Except they aren’t merely reading and reciting content, are they? That’s a rather disingenuous argument to make. All these AI companies are high on billions in investment and think they can run roughshod over all rules in the sprint towards monetizing their services.
Make no mistake, they’re seeking to exploit the contents of that material for profits that are orders of magnitude larger than what any shady pirated-material reseller would make. The world looks the other way because these companies are “visionary” and “transformational.”
Maybe they are, and maybe they should even have a right to these buried works, but what gives them the right to rip up the rule book and (in all likelihood) suffer no repercussions in an act tantamount to grand theft?
There’s certainly an argument to be had about whether this form of research and training is a moral good and beneficial to society. My first impression is that the companies are too opaque in how they use and retain these files, albeit for some legitimate reasons, but nevertheless the archival achievements are hidden from the public, so all that’s left is profit for the company on the backs of all these other authors.
before breaking the law, set up a corporation to absorb the liability!
in other words, provided you have enough spare capital to spin up a corporation, you can break the law!!!!
What point are you making? 20 years ago, someone sold pirated copies of software (wheres the transformation here) and that's the same as using books in a training set? Judge already said reading isnt infringement.
This is reaching at best.
Aren't you comparing the wrong things? First example is about the output/outcome, what is the equivalent for LLMs? Also, not all "pirated" things are sold, most are in fact distributed for free.
"Pirates" also transform the works they distribute. They crack it, translate it, compress it to decrease download times, remove unnecessary things, make it easier to download by splitting it in chunks (essential with dial-up, less so nowadays), change distribution formats, offer it trough different channels, bundle extra software and media that they themselves might have coded like trainers, installers, sick chiptunes and so on. Why is the "transformation" done by a big corpo more legal in your views?
2 replies →
Peterson was copying and selling pirated software.
Come up with a better comparison.
Anthropic is selling a service that incorporates these pirated works.
19 replies →
copyright is not the same as piracy
piracy isn't a thing, except on the high seas. what you're thinking about is copyright violation.
6 replies →
Can you explain why? What makes them categorically different or at the very least why is "piracy" quantitatively worse than 'just' copyright violation?
15 replies →
Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].
https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...
Funky quote:
> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.
Crunchyroll was originally an anime piracy site that went legit and started actually licensing content later. They started in mid-2006, got VC funding in 2008, then made their first licensing deal in 2009.
https://www.forbes.com/2009/08/04/online-anime-video-technol...
https://venturebeat.com/business/crunchyroll-for-pirated-ani...
Yep, they were huge too - virtually anyone who watched free anime back then would have known about them.
My theory is that once they saw how much traffic they were getting, they realized how big of a market (subbed/dubbed) anime was.
Good Old Games started out with the founders selling pirated games on disc at local markets.
1 reply →
And now Crunchyroll is owned by (through a lot of companies, like Aniplex of America, Aniplex, A1 Pictures) Sony, who produces a large amount of anime!
not just Spotify pretty much any (most?) current tech giant was build by
- riding a wave of change
- not caring too much about legal constraints (or like they would say now "distrupting" the market, which very very often means doing illigal shit which beings them far more money then any penalties they will ever face from it)
- or caring about ethics too much
- and for recent years (starting with Amazone) a lot of technically illegal financing (technically undercutting competitors prices long term based on money from else where (e.g. investors) is unfair competitive advantage (theoretically) clearly not allowed by anti monopoly laws. And before you often still had other monopoly issues (e.g. see wintel)
So yes not systematic not complying with law to get unfair competitive advantage knowing that many of the laws are on the larger picture toothless when applied to huge companies is bread and butter work of US tech giants
As you point out, they mostly did this before they were large companies (where the public choice questions are less problematic). Seems like the breaking of these laws was good for everybody.
5 replies →
"recording obtained unofficially" and "doesn't have rights to the recording" are separate things. So they could well have got a license to stream a publisher's music but that didn't come with an actual copy of some/all of the music.
It wasn’t just the content being pirated, but the early Spotify UI was actually a 1:1 copy of Limewire.
There's plenty of startups gone legitimate.
Society underestimates the chasm that exists between an idea and raising sufficient capital to act on those ideas.
Plenty of people have ideas.
We only really see those that successfully cross it.
Small things EULA breaches, consumer licenses being used commercially for example.
The problem is that these "small things" are not necessarily small if you're an individual.
If you're an individual pirating software or media, then from the rights owners' perspective, the most rational thing to do is to make an example of you. It doesn't happen everyday, but it does happen and it can destroy lives.
If you're a corporation doing the same, the calculation is different. If you're small but growing, future revenues are worth more than the money that can be extracted out of you right now, so you might get a legal nastygram with an offer of a reasonable payment to bring you into compliance. And if you're already big enough to be scary, litigation might be just too expensive to the other side even if you answer the letter with "lol, get lost".
Even in the worst case - if Anthropic loses and the company is fined or even shuttered (unlikely) - the people who participated in it are not going to be personally liable and they've in all likelihood already profited immensely.
1 reply →
but it's not some small things
but systematic wide spread big things and often many of them, giving US giant a unfair combative advantage
and don't think if you are a EU company you can do the same in the US, nop nop
but naturally the US insist that US companies can do that in the EU and complain every time a US company is fined for not complying for EU law
>Society underestimates the chasm that exists between an idea and raising sufficient capital to act on those ideas.
The AI sector, famously known for its inability to raise funding. Anthropic has in the last four years raised 17 billion dollars
1 reply →
Uber
There's no credible evidence Spotify built their company and business on pirated music.
This is a narrative that gets passed around in certain circles to justify stealing content.
21 replies →
Apparently it's a common business practice.
It's not a common business practice. That's why it's considered newsworthy.
People on the internet have forgotten that the news doesn't report everyday, normal, common things, or it would be nothing but a listing of people mowing their lawns or applying for business loans. The reason something is in the news is because it is unusual or remarkable.
"I saw it online, so it must happen all the time" is a dopy lack of logic that infects society.
You are right on that. I’ll edit my post to reflect that.
Edit: Apologies, I can’t edit it anymore.
Google Music originally let people upload their own digital music files. The argument at the time was that whether or not the files were legally obtained was not Google’s problem. I believe Amazon had a similar service.
https://www.computerworld.com/article/1447323/google-reporte...
This isn't as meaningful as it sounds. Nintendo was apparently using scene roms for one of the official emulators on Wii (I think?). Spotify might have received legally-obtained mp3s from the record companies that were originally pulled from Napster or whatever, because the people who work for record companies are lazy hypocrites.
The Nes classic console. The roms had an iNes emulator header lol.
And the playstation classic used an opensource ps1 emulator.
There was also some steam game ported from GameCube, and it had the Dolphin Emulator FPS counter in the corner of part of the trailer :D
I also remember reading that 2 of the PCSX2 devs ended up working on the EmotionEngine chip emulator for PS3 consoles with partial software emulation of PS2 (The CECH 02 and later models where they removed the EmotionEngine chip)
The common meme is that megacorps are shamelessly criminalistic organizations that get away with doing anything they can to maximize profits, while true in some regard, totally pales in comparison to the illegal things small businesses and start-ups do.
YouTube's initial success came from being able to serve, on a global scale, user-uploaded, largely uncredited copyright violations of both video and audio.
Facebook's "pivot to video" similarly relied on user-uploaded unlicensed video content, now not just pulling from television and film, but from content creators on platforms like YouTube.
Today, every "social" platform is now littered with "no copyright infringement intended" and "all credit to the original" copy-and-paste junk. Don't get me wrong, I'm a fan of remix culture – but I believe appropriating and monetizing the work of others without sharing the reward is a destructive cycle. And while there are avenues for addressing this, they're designed for the likes of Universal, Sony, Disney, etc. (I've had original recordings of original music flagged by megacorps because the applause triggered ContentID.)
AI slop further poisons the well. It's rough going out there.
You are missing the point. Spotify had permission from the copyright holders and/or their national proxies to use those songs in a limited beta in Sweden. They didn't have access to clean audio data directly from the record companies, so in many cases they used pirated rips instead.
What you really should be asking is whether they infringed on the copyrights of the rippers. /s
They had a second company (which I don't remember the name) that allowed users to backup and share their music. When they were exposed they dug that as deep as they could
No. There's no credible evidence Spotify had any secret second company that allowed users to back up and share music without authorisation
It was the opposite. Their mission was to combat music piracy by offering a better, legal alternative.
Daniel Ek said: "my mission is to make music accessible and legal to everyone, while ensuring artists and rights holders got paid"
Also, the Swedish government has zero tolerance for piracy.
Mission is just words, they can mean the opposite of deeds, but they can't be the opposite, they live in different realms.
I know this might come as a shock to those living in San Francisco, but things are different in other parts of the world, like Uruguay, Sweden and the rest of Europe. From what I’ve read, the European committee actually cares about enforcing the law.
Anthropic's cofounder, Ben Mann, downloaded million copies of books from Library Genesis in 2021, fully aware that the material was pirated.
Stealing is stealing. Let's stop with the double standards.
At least most pirates just consume for personal use. Profiting from piracy is a whole other level beyond just pirating a book.
Someone on Twitter said: "Oh well, P2P mp3 downloads, although illegal, made contributions to the music industry"
That's not what's happening here. People weren't downloading music illegally and reselling it on Claude.ai. And while P2P networks led to some great tech, there's no solid proof they actually improved the music industry.
11 replies →
I feel like profit was always a central motive of pirates. At least from the historical documents known as, "The Pirates of the Caribbean".
This isn't really profiting from piracy. They don't make money off the raw input data. It's no different to consuming for personal use.
They make money off the model weights, which is fair use (as confirmed by recent case law).
18 replies →
> At least most pirates just consume for personal use.
Easy for the pirate to say. Artists might argue their intent was to trade compensation for one's personal enjoyment of the work.
5 replies →
stealing with the intent to gain a unfair marked advantage so that you can effectively kill any ethically legally correctly acting company in a way which is very likely going to hurt many authors through the products you create is far worse then just stealing for personal use
that isn't "just" stealing, it's organized crime
> Stealing is stealing. Let's stop with the double standards.
I get the sentiment, but that statement as is, is absurdly reductive. Details matter. Even if someone takes merchandise from a store without paying, their sentence will vary depending on the details.
Let's get actual definitions of 'theft' before we leap into double standards.
Copyright infringement is not stealing.
It's very similar to theft of service.
There's so many texts, and they're so sparse that if I could copyright a work and never publish it, the restriction would be irrelevant. The probability that you would accidentally come upon something close enough that copyright was relevant is almost infinitesimal.
Because of this copyright is an incredibly weak restriction, and that it is as weak as it is shows clearly that any use of a copyrighted work is due to the convenience that it is available.
That is, it's about making use of the work somebody else has done, not about that restricting you somehow.
Therefore copyright is much more legitimate than ordinary property. Ordinary property, especially ownership of land, can actually limit other people. But since copyright is so sparse infringing on it is like going to world with near-infinite space and picking the precise place where somebody has planted a field and deciding to harvest from that particular field.
Consequently I think copyright infringement might actually be worse than stealing.
8 replies →
actually, the Only time it's a (ethical) crime is when a corporation does it at scale for profit.
property infringement isn't either?
1 reply →
[flagged]
10 replies →
[flagged]
oh well, the product has a cute name and will make someone a billionaire, let's just give it the green light. who cares about copyright in the age of AI?
Information wants to be free.
Then why does Claude cost money?
1 reply →
Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?
The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.
I'm against Anthropic stealing teacher's work and discouraging them from ever writing again. Some teachers are already saying this (though probably not in California).
> The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.
I think this is a fantasy. My father cowrote a Springer book about physics. For the effort, he got like $400 and 6 author copies.
Now, you might say he got a bad deal (or the book was bad), but I don't think hundreds of thousands of authors do significantly better. The reality is, people overwhelmingly write because they want to, not because of money.
3 replies →
Stealing? In what way?
Training a generative model on a book is the mechanical equivalent of having a human read the book and learn from it. Is it stealing if a person reads the book and learns from it?
7 replies →
[flagged]
That will be sad, although there will still be plenty of great people who will write books anyway.
When it comes to a lot of these teachers, I'll say, copyright work hand in hand with college and school course book mandates. I've seen plenty of teachers making crazy money off students' backs due to these mandates.
A lot of the content taught in undergrad and school hasn't changed in decades or even centuries. I think we have all the books we'll ever need in certain subjects already, but copyright keeps enriching people who write new versions of these.
If you care so little about writing that AI puts you off it, TBH you're probably not a great writer anyhow.
Writers that have an authentic human voice and help people think about things in a new way will be fine for a while yet.
1 reply →
They won't be needed anymore, once singularity is reached. This might be their thought process. This also exemplifies that the loathed caste system found in India is indeed in place in western societies.
There is no equality, and seemingly there are worker bees who can be exploited, and there are privileged ones, and of course there are the queens.
5 replies →
150K per work is the maximum fine for willful infringement (which this is).
105B+ is more than Anthropic is worth on paper.
Of course they’re not going to be charged to the fullest extent of the law, they’re not a teenager running Napster in the early 2000s.
> 150K per work is the maximum fine for willful infringement
No, its not.
It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.
Of course, there's also a very wide range of statutory damages, the minimum (if it is not "innocent" infringement) is $750/work.
> 105B+ is more than Anthropic is worth on paper.
The actual amount of 7 million works times $150,000/work is $1.05 trillion, not $105 billion.
1 reply →
Even if they don't qualify for willful infringement damages (lets say they have a good faith belief their infringement was covered by fair use) the standard statutory damages for copyright infringement are $750-$30,000 per work.
Plus they did it with a profit motive which would entail criminal proceedings.
Isn't "pirating" a felony with jail time, though? That's what I remember from the FBI warning I had to see at the beginning of every DVD I bought (but not "pirated" ones).
Yes criminal copyright infringement (willful copyright infringement done for commercial gain or at a large scale) is a felony.
[flagged]
4 replies →
Just downloading them is of course cheaper, but it is worth pointing out that, as the article states, they did also buy legitimate copies of millions of books. (This includes all the books involved in the lawsuit.) Based on the judgement itself, Anthropic appears to train only on the books legitimately acquired. Used books are quite cheap, after all, and can be bought in bulk.
Buying a book is not license to re-sell that content for your own profit. I can't buy a copy of your book, make a million Xeroxes of it and sell those. The license you get when you buy a book is for a single use, not a license to do what ever you want with the contents of that book.
2 replies →
If you wanted to be legit with 0 chance of going to court, you would contact publisher and ask to pay a license to get access to their catalog for training, and negotiate from that point.
This is what every company using media are doing (think Spotify, Netflix, but also journal, ad agency, ...). I don't know why people in HN are giving a pass to AI company for this kind of behavior.
> I don't know why people in HN are giving a pass to AI company for this kind of behavior.
As mentioned in The Fucking Article, there's a legal difference between training an AI which largely doesn't repeat things verbatim (ala Anthropic) and redistributing media as a whole (ala Spotify, Netflix, journal, ad agency).
[flagged]
2 replies →
Because they are mostly software developers who think it's different because it impacts them.
This is not about paying for a single copy. It would still be wrong even if they have bought every single one of those books. It is a form of plagiarism. The model will use someone else's idea without proper attribution.
Legally speaking, we don't know that yet. Early signs are pointing at judges allowing this kind of crap because it's almost impossible for most authors to point out what part of the generated slop was originally theirs.
At minimum they should have to buy the book they are deriving weights from.
But should the purchase be like a personal license? Or like a commercia license that costs way more?
Because for example if you buy a movie on disc, that's a personal license and you can watch it yourself at home. But you can't like play it at a large public venue that sell tickets to watch it. You need a different and more expensive license to make money off the usage of the content in a larger capacity like that.
1 reply →
Google did it the legal way with Google Books, didn't they?
[flagged]
8 replies →
> Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books.
$500,000 per infringement...
And the crazy thing is that might be cheaper when you consider the alternative is to have your lawyers negotiate with the lawyers for the publishing companies for the right to use the works as training data. Not only is it many many billable hours just to draw up the contract, but you can be sure that many companies would either not play ball or set extremely high rates. Finally, if the publishing companies did bring a suit against Anthropic they might be asked to prove each case of infringement, basically to show that a specific work was used in training, which might be difficult since you can't reverse a model to get the inputs. When you're a billion dollar company it's much easier to get the courts to take your side. This isn't like the music companies suing teenagers who had a Kazaa account.
> I'm not saying this is justified, but what would you have done in their situation?
Individuals would have their lives ruined either from massive fines or jail time.
These are the people shaping the future of AI? What happened to all the ethical values they love to preach about?
We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?
> We've held China accountable for counterfeiting products for decades and regulated their exports
We have? Are we from different multi-verses?
The one I've lived in to date has not done anything against Chinese counterfeits beyond occasionally seizing counterfeit goods during import. But that's merely occasionally enforcing local counterfeit law, a far cry from punishing the entity producing it.
As a matter of fact, the companies started outsourcing everything to China, making further IP theft and quasi-copies even easier
I was gonna say, the enforcement is so weak that it's not even really worth it to pursue consumer hardware here in the US. Make product that is a hit, patent it, and still 1 month later IYTUOP will be selling an identical copy for 1/3rd the price on Amazon.
1 reply →
IP theft is one of the stated reasons for the trade war in the first place. It’s one of the major gripes the US has against China. There are limited means available to restrict a foreign country compared with an entity in the US. The DoJ did sue Huawei and win though.
Whether or not the countermeasures have been effective in practice is a minor detail in the GP point that we would not expect an American company headquartered in the US and conducting significant business in the US to get away with the same thing.
One rule for you, one rule for me ...
You never noticed the hypocrite behavior all over society?
* O, you drunk drive, big fine, lots of trouble. * O, you drunk drive and are a senator, cop, mayor, ... Well, lets look the other way.
* You have anger management issues and slam somebody to the ground. Jail time. * You as a cop have anger management issues and slams somebody to the ground. Well, paid time off while we investigate and maybe a reprimand. Qualified immunity boy!
* You tax fraud for 10k, felony record, maybe jail time. * You as a exec of a company do tax fraud for 100 million. After 10 years lawyering around, maybe you get something, maybe, ... o, here is a fine of 5 million.
I am sorry but the idea of everybody being equal under the law has always been a illusion.
We are holding China accountable for counterfeiting products because it hurts OUR companies, and their income. But when its "us vs us", well, then it becomes a bit more messy and in general, those with the biggest backing (as in $$$, economic value, and lawyers), tends to win.
Wait, if somebody steal my book, i can sue that person in court, and get a payout (lawyers will cost me more but that is not the point). If some AI company steals my book, well, the chance you win is close to 1%, simply because lots of well paid lawyers will make your winning hard to impossible.
Our society has always been based upon power, wealth and influence. The more you have of it, the more you get away (or reduced) with things, that gets other fined or jailed.
The unethical ones didn't buy any books.
Why is it unethical of them to use the information in all these books? They are clearly not reselling the books in any way, shape, or form. The information itself in a book can never be copyrighted. You can also publish and sell material where you quote other books within it.
break things and move fast
This is the underlying caste system coming to life right before your eyes :D.
I think caste system is the wrong analogy here.
Comment is more about the pseudo ethical high ground
1 reply →
Silicon Valley has always been the antithesis of ethics. It's foundations are much more right wing and libertarian, along the extremist lines.
[flagged]
If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.
Are we reading the same article? The article explicitly states that it's okay to cut up and scan the books you own to train a model from them.
> I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them
The ruling would be a huge win for AI companies if held. It's really weird that you reached the opposite conclusion.
Something missed in arguments such as these is that in measuring fair use there's a consideration of impact on the potential market for a rightsholder's present and future works. In other words, can it be proven that what you are doing is meaningfully depriving the author of future income.
Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.
On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.
Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.
> a consideration of impact on the potential market for a rightsholder's present and future works
This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.
As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?
We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.
10 replies →
The core problem here is that copyright already doesn't actually follow any consistent logical reasoning. "Information wants to be free" and so on. So our own evaluation of whether anything is fair use or copyrighted or infringement thereof is always going to be exclusively dictated by whatever a judge's personal take on the pile of logical contradictions is. Remember, nominally, the sole purpose of copyright is not rooted in any notions of fairness or profitability or anything. It's specifically to incentivize innovation.
So what is the right interpretation of the law with regards to how AI is using it? What better incentivizes innovation? Do we let AI companies scan everything because AI is innovative? Or do we think letting AI vacuum up creative works to then stochastically regurgitate tiny (or not so tiny) slices of them at a time will hurt innovation elsewhere?
But obviously the real answer here is money. Copyright is powerful because monied interests want it to be. Now that copyright stands in the way of monied interests for perhaps the first time, we will see how dedicated we actually were to whatever justifications we've been seeing for DRM and copyright for the last several decades.
Everything is different at scale. I'm not giving a specific opinion on copyright here, but it just doesn't make sense when we try to apply individual rights and rules to systems of massive scale.
I really think we need to understand this as a society and also realize that moneyed interests will downplay this as much as possible. A lot of the problems we're having today are due to insufficient regulation differentiating between individuals and systems at scale.
"Judge says training Claude on books was fair use, but piracy wasn't."
The difference here is that an LLM is a mechanical process. It may not be deterministic (at least, in a way that my brain understands determinism), but it's still a machine.
What you're proposing is considering LLMs to be equal to humans when considering how original works are created. You could make the argument that LLM training data is no different from a human "training" themself over a lifetime of consuming content, but that's a philosophical argument that is at odds with our current legal understanding of copyright law.
That's not a philosophical argument at odds with our current understanding of copyright law. That's exactly what this judge found copyright law currently is and it's quoted in the article being discussed.
2 replies →
It’s easy to point fingers at others. Meanwhile the top comment in this thread links to stolen content from Business Insider.
How is it stolen from Business Insider? When I visit businessinsider.com/anthropic-cut-pirated-millions-used-books-train-claude-copyright-2025-6 I get the same story. My browser caches the story, and I save it for archival purposes. How is this theft?
BI decides who can access this content and who will get the paywall. The link to archive page allows people to access this content without permission. That’s called stealing.
1 reply →
Woah woah woah I just read it I didn’t sell it to anybody
Best godamn comment in this whole thread. Now we can have fun reading the the mental gymnastics !
Buying, scanning, and discarding was in my proposal to train under copyright restrictions.
You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.
This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.
[flagged]
That's true and was the distinction I was making. In my proposal, and maybe part of what Anthropic did, the digitized copies are used as training data for a new work, the model. That reduces the risk of legal rulings against using the copyrighted works.
From there, the cases would likely focus on whether that fits in established criteria for digitized copies, whether they're allowed in the training process itself, and the copyright status of the resulting model. Some countries allow all of that if you legally obtained the material in the first place. Also, they might factor whether it's for commercial use or not.
I'm not seeing how this is fair use in either case.
Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?
It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.
Copyright is largely about distributing copies. It’s not about making something vaguely similar or about referencing copyrighted work to make something vaguely similar.
Although, there’s an exception for fictional characters:
https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...
Copyright is not on “information”, It’s on the tangible expression (i.e., the actual words). “Transformative use” is a defense in copyright infringement.
Digitizing the books is the equivalent of a blind person doing something to the book to make it readable to them... the software can't read analog pages.
Learning from the book is, well, learning from the book. Yes, they intended to make money off of that learning... but then I guess a medical student reading medical textbooks intends to profit off of what they learn from them. Guess that's not fair use either (well, it's really just use, as in the intended use for all books since they were first invented).
Once a person has to believe that copyright has any moral weight at all, I guess all rational though becomes impossible for them. Somehow, they're not capable of entertaining the idea that copyright policy was only ever supposed to be this pragmatic thing to incentivize creative works... and that whatever little value it has disappears entirely once the policy is twisted to consolidate control.
>clearly going against what copyright stands for.
Copyright isnt a digital moat. Its largely an agreement that the work is available to the public, but the creator has a limited amount of time to exploit it at market.
If you sell an AI model, or access to an AI model, theres usually around 0% of the training data redistributed with the model. You cant decompile it and find the book. As you aren't redistributing the original work copyright is barely relevant.
Imagine suggesting that because you own the design of a hammer, that all works created with the hammer belong to you and cant be sold?
That someone came up with a new method of using books as a tool to create a different work, does not entitle the original book author to a cut of the pie.
Available to the public is one thing, but a for-profit company is not "the public". They are providing a service that makes that work, regardless of what ever form it is in, available to the public. This seems like a middle man situation that makes a profit off of access to information, regardless of what form it is in.
> to make a profit off of the information that is included in these works?
Isn't that what a lot of companies are doing, just through employees? I read a lot of books, and took a lot of courses, and now a company is profiting off that information.
They clearly were being digitized, but I think its a more philosophical discussion that we're only banging our heads against for the first time to say whether or not it is fair use.
Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings. Digitization is just memory. If the models cannot think then it is meaningless digital regurgitation and plagiarism, not to mention breach of copyright.
The quotes "consistent with copyright's purpose in enabling creativity and fostering scientific progress." and "Like any reader aspiring to be a writer" say, from what I can tell, that the judge has legally ruled the model can think as a human does, and therefore has the legal protections afforded to "creatives."
In my mind, there is a difference between a person using there own creative thinking to create a derivative work from learning about a subject and making money off of it versus a corporation with a language model that is designed to absorb the works of the entire planet and redisrubtes that information in away that puts them in a centralized position to become an authority on information. With a person, there is a certain responsibility one has to create meaning from that work so that others can experience it. For-profit companies are like machines that have no interest in the creative expression part of this process hence there is a concern that they do not have the best interests of the public at heart.
> Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings.
No, that's fallacious. Using anthropomorphic words to describe a machine does not give it the same kinds of rights and affordances we give real people.
9 replies →
What do you think fair use is? The whole point of the fair use clauses is that if you transform copyrighted works enough you don't have to pay the original copyright holder.
Fair use is not, at its core, about transformation. It's about many types of uses that do not interfere with the reasons for the rights we ascribe to authors. Fair use doesn't require transformation.
There is another case where companies slurped up all of the internet and profited off the information, that makes a good comparison - search engines.
Judges consider a four factor when examining fair use[1]. For search engines,
1) The use is transformative, as a tool to find content is very different purpose than the content itself.
2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.
3) The search engine store significant portions of the work in the index, but it only redistributes small portions.
4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.
So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.
Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.
So now lets look at LLMs:
1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.
2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)
3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.
4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.
This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.
I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.
[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/
Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use, but the legal industry disagrees.
Which of the following are true?
(a) the legal industry is susceptible to influence and corruption
(b) engineers don't understand how to legally interpret legal text
(c) AI tech is new, and judges aren't technically qualified to decide these scenarios
Most likely option is C, as we've seen this pattern many times before.
There's a lot of conflation of "should/shouldn't" and "is/isn't". The comments by tech folk you're alluding to mostly think that it "shouldn't" be fair use, out of concern about the societal consequences, whereas judges are looking at it and saying that it "is" fair use, based on the existing law.
Any reasonable reading of the current state of fair use doctrine makes it obvious that the process between Harry Potter and the Sorcerer's Stone and "A computer program that outputs responses to user prompts about a variety of topics" is wildly transformative, and thus the usage of the copyrighted material is probably covered by fair use.
> Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use
Where are you getting your data from? My conclusions are the exact opposite.
(Also, aren't judges by definition the only ones qualified to declare if it is actually fair use? You could make a case that it shouldn't be fair use, but that's different from it being not fair use.)
Idk, I think most people in tech I talk to IRL think it is fair use?
I think the overly liberal, non-tech crowd has become really vocal on HN as of late and your sample is likely biased by these people.
If I allegedly train off of your training, which was trained off of copyrighted content under fair use, we're good right?
Just asking for a friend who's into this sort of thing.
Armchair commentators, including myself, tend to be imprecise when speaking about whether something is illegal, versus something should be illegal. Sometimes due to a misunderstanding of the law, or an over-estimation of the court's authority, or an over-estimation of our legislature's productivity, or just because we're making conversation and like talking.
I don't understand at all the resistance to training LLMs on any and all materials available. Then again, I've always viewed piracy as a compatible with markets and a democratizing force upon them. I thought (wrongly?) that this was the widespread progressive/leftist perspective, to err on the side of access to information.
Seeing as (a) is true in the US Supreme Court, it's probably at least as true in the lower courts.
It's not likely you've actually gotten the opinion of the "majority of tech folks", just the most outspoken ones, and only in specific bubbles you belong to.
I know for sure (b) is true. Way too many people on technical forums read legal texts as if the process to interpret laws is akin to a compiler generating a binary.
I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.
If the AI movement will manage to undermine Imaginary Property, it would redeem it's externalities threefold.
I don’t think that’s gonna happen. I think they will manage to get themselves out of trouble for it, while the rest of us will still face serious problems if we are caught torrenting even one singular little book.
It's already quite widespread and likely legal for average people to train AI models on copyrighted material, in the open weight AI communities like SD and LocalLLaMa.
Please, please differentiate between pirating books (which Anthrophic is liable for, and is still illegal) and training on copyrighted material (which was found to be legal, for both corporations and average people).
Even so, would be hard to prove that this particular little book wasn't generated by Claude (oopsie, it happens to be a verbatim copy of a copyrighted work, that happens sometimes, those pesky LLMs).
1 reply →
The Ocean Full of Bowling Balls
It would be great, but I think some are worried that new AI BigTech will find a way to continue enforcing IP on the rest of society while it won't exist for them
I think that we are worried because I think that's exactly what's going to happen/ is happening.
What are your feelings about how the small fish is stripped of their arts, and their years of work becomes just a prompt? Mainly comic artists and small musicians who are doing things they like and putting out for people, but not for much money?
>Mainly comic artists and small musicians who are doing things they like and putting out for people, but not for much money?
The number of these artists I have seen receiving some bogus DMCA takedown notice for fan art is crazy.
I saw a bloke give away some of his STL's because he received a takedown request from games workshop and didnt have the funds to fight it.
Its not that I want small artists to lose, its that I want them to gain access to every bloody copyright and trademark so they are more free to create.
Shit Conde Nast managed to pull something like 400 pulps off the market, so they didnt interfere with their newly launched James Patterson collaborations.
[flagged]
10 replies →
[flagged]
Yup.
My response to this whole thread is just “good”
Aaron Swartz is a saint and a martyr.
It's true that intellectual property is a flawed and harmful mechanism for supporting creative work, and it needs to change, but I don't think ensuring a positive outcome is as simple as this. Whether or not such a power struggle between corporate interests benefits the public rather than just some companies will be largely accidental.
I do support intellectual property reform that would be considered radical by some, as I imagine you do. But my highest hopes for this situation are more modest: if AI companies are told that their data must be in the public domain to train against, we will finally have a powerful faction among capitalists with a strong incentive to push back against the copyright monopolists when it comes to the continuous renewal of copyright terms.
If the "path of least resistance" for companies like Google, Microsoft, and Meta becomes enlarging the public domain, we might finally begin to address the stagnation of the public domain, and that could be a good thing.
But I think even such a modest hope as that one is unlikely to be realized. :-\
That would render GPL and friends redundant too... copyleft depends on copyright.
Copyleft nullifies copyright. Abolishing copyright and adding right to repair laws (mandatory source files) would give the same effect as everyone using copylefted licenses.
It will undermine it only for the rich owner of AI companies, not for everyone.
By the way I wonder if recent advancement in protecting Youtube videos from downloaders like yt-d*p are caused by unwillingness to help rival AI companies gather the datasets.
The buried lede here is Antrhopic will need to attempt to explain to a judge that it is impossible to de-train 7M books from their models.
I'm hoping they fail to incentivize using legal, open, and/or licensed data. Then, thry might have to attempt to train a Claude-class model on legal data. Then, I'll have a great, legal model to use. :)
How come? They just need to delete the model and train a new one without those books.
Or they could be forced to settle a price for access to the books.
Anyone read the 2006 sci-fi book Rainbow's End that has this? It was set in 2025.
I was 100% thinking this. GREAT book. And they, too, shredded books to ingest them into the digital library! I don't recall if it was an attempt to bypass copyright though; in Rainbow's End, it was more technical, as it was easier to shred, scan the pieces, and reassemble them in software, rather than scanning each page.
actual title:
"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."
A not-so-subtle difference.
That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.
Yeah, I'm not sure if people realize that the whole reason they had to cut up the books was because they wanted to comply with copyright law. Artificial scarcity.
The importance of acquiring the physical book was the transfer of compensation to the author.
You're not wrong, but that's one heck of a way to do it. It involves the destruction of 7 million books, which ... I really don't quite see the "promotion of Progress of Science and useful Arts" in that.
If ingesting books into an AI makes Anthropic criminals, then Google et al are also criminals alike for making search indexes of the Internet. Anything published online is equally copyrighted.
Exactly! If Anthropic is guilty of copyright infringement for the mere act of downloading copyrighted books then so is Google, Microsoft (Bing), DuckDuckGo, etc. Every search engine that exists downloads pirated material every day. They'd all be guilty.
Not only that but all of us are guilty too because I'm positive we've all clicked on search results that contained copyrighted content that was copied without permission. You may not have even known it was such.
Remember: Intent is irrelevant when it comes to copyright infringement! It's not that kind of law.
Intent can guide a judge when they determine damages but that's about it.
Yeah, we can all agree that ingesting books is fair use and transformative, but you gotta own what you ingest, you can't just pirate it.
I can read 100 books and write a book based on the inspiration I got from the 100 books without any issue. However, if I pirate the 100 books I've still committed copyright infringement despite my new book being fully legal/fair use.
I disagree that it has anything to do with copyright. It is at most theft. If I steal a bunch of books from the library, I haven't committed any breach of copyright.
Anyone else thinks destroying books for any reason is wrong ?
Or is it perhaps not an universal cultural/moral aspect ?
I guess for example in Europe people could be more sensitive to it.
If they aren't one of a kind and they digitally preserved them in some way i think i would be ok with it.
Saying that though there are tools for digitizing books that don't require destroying them
There's nothing sacred about books. There are plenty of books that won't be missed if destroyed.
I have purposefully destroyed one book in my life, in order to prevent anyone from reading it:
Man of Two Worlds by Brian Herbert.
...and I did the world a favor.
Just looked this up and now I might read it, ostensibly because of the fact that you destroyed it.
1 reply →
Order on Fair Use
https://ia800101.us.archive.org/15/items/gov.uscourts.cand.4...
something i've been trying to reconcile: i buy a cheap used book on biblio and i'm morally ok even though the writer doesn't get paid. but if i pirate the book, then i'm wrong for that because the writer doesn't get paid?
Based on the fact people went to jail for downloading some music or movies, this guy will face a lifetime in prison for 7 million books that he then used for commercial profit right?
Right guys we don't have rules for thee but not for me in the land of the free?
Same did Meta and probably other big companies. People who praise AGI are very short sighted. It will ruin the world with our current morals and ethics. It's like a nuclear weapon in the hands of barbarians (shit, we have that too, actually).
The main problem I have with that argument is when have we ever been "ready" for any technology ahead of times? Its impact is always unknown not only on society but in the technology itself. What sort of 'preparatory' work can a society do to satisfy this? If we were to apply such 'precautionary' logic to automobiles it would at very best be starting out with a large public works project to create separated grades for all vehicles, before building a single vehicle (where they have no idea how tight the constraints will truly be, thinking they will go only as fast as a horse at a gallop). It is far more likely to wind up a silly waste of time or pretext for Luddism to try to delay the inevitable. Like it or not, being capable of doing something is the only true deciding standard of being ready for something.
Not to mention the whole notion of being able to judge others as 'not ready for it' is an insult to the very notion of individual self-determination. Imagine for instance if the western world in the past had took after Starfleet in the worst of ways and banned supplying medical aid to sub-Saharan Africa as they judged their society as not ready for it. They would rightfully be called callous, arrogant racist imperialists for thinking it is their right to impose suffering upon others and deny others opportunities and self-determination because of their own parochial judgement and thinking they knew better! Putting oneself in the position to be able to judge for the world is an act of hubris far greater than that they project upon the attempted inventors of AGI.
So using the standard industry metrics for calculating the financial impact of piracy, this would equate to something like trillions of damages to the book publishing industry?
Two week old news.
Some previous discussions:
https://news.ycombinator.com/item?id=44381639
If AI companies are allowed to use pirated material to create their products, does it mean that everyone can use pirated software to create products? Where is the line?
Also please don't use word "learning", use "creating software using copyrighted materials".
Also let's think together how can we prevent AI companies from using our work using technical measures if the law doesn't work?
But the AI used the content to learn how to copy and recreate it. Is ‘re-creation’ a better concept for us?
People already use pirated software for product creation.
Hypothetical:
I know a guy who learned photoshop on a pirated copy of Photoshop. He went on to be a graphic designer. All his earnings are ‘proceeds from crime’
He never used the pirated software to produce content.
So can we officially download pirated content to learn stuff now?
3 replies →
It's abusive and wrong to try and prevent AI companies from using your works at all.
The whole point of copyright is to ensure you're paid for your work. AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.
If that LLM reproduces your work, then the AI company is violating copyright, but if the LLM doesn't reproduce your work, then you have not been harmed. Trying to claim harm when you haven't been due to some philosophical difference in opinion with the AI company is an abuse of the courts.
> It's abusive and wrong to try and prevent AI companies from using your works at all.
People don't view moral issues in the abstract.
A better perspective on this is the fact that human individuals have created works which megacorps are training on for free or for the price of a single book and creating models which replace individuals.
The megacorps are only partially replacing individuals now, but when the models get good enough they could replace humans entirely.
When such a future happens will you still be siding with them or with individual creators?
2 replies →
> The whole point of copyright is to ensure you're paid for your work.
No. The point of copyright is that the author gets to decide under what terms their works are copied. That's the essence of copyright. In many cases, authors will happily sell you a copy of their work, but they're under no obligation to do so. They can claim a copyright and then never release their work to the general public. That's perfectly within their rights, and they can sue to stop anybody from distributing copies.
4 replies →
Current copyright law is not remotely sophisticated enough to make determinations on AI fair use. Whether the courts say current AI use is fair is irrelevant to the discussion most people on this side would agree with: That we need new laws. The work the AI companies stole to train on was created under a copyright regime where the expectation was that, eh, a few people would learn from and be inspired from your work, and that feels great because you're empowering other humans. Scale does not amplify Good. The regime has changed. The expectations under what kinds of use copyright protects against has fundamentally changed. The AI companies invented New Horrors that no one could have predicted, Vader altered the deal, no reasonable artist except the most forward-thinking sci-fi authors would have remotely guessed what their work would be used for, and thus could never have conciously and fairly agreed to this exchange. Very few would have agreed to it.
It is not wrong at all. The author decides what to do with their work. AI companies are rich and can simply buy the rights or hire people to create works.
I could agree with exceptions for non-commercial activity like scientific research, but AI companies are made for extracting profits and not for doing research.
> AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.
It doesn't work this way. If you buy a movie it doesn't mean you can sell goods with movie characters.
> then you have not been harmed.
I am harmed because less people will buy the book if they can simply get an answer from LLM. Less people will hire me to write code if an LLM trained on my code can do it. Maybe instead of books we should start making applications that protect the content and do not allow copying text or making screenshots. ANd instead of open-source code we should provide binary WASM modules.
7 replies →
~1B USD in cash is the line where laws apply very differently
Especially in the US
Where are you reading that?
You are allowed to buy and scan books, and then used those scanned books to create products. I guess you are also allowed to pirate books and use the knowledge to create products if you are willing to pay the damages to the rights holders for copyright violations.
When I was young and poor I learned on pirated software. Do I owe Adobe, Microsoft and others a percentage of my today income?
Ask them?
Let’s say my AI company is training an AI on woodworking books and at the end, it will describe in text and wireframe drawings (but not the original or identical photos) how to do a particular task.
If I didn’t license all the books I trained on, am I not depriving the publisher of revenue, given people will pay me for the AI instead of buying the book?
Copyright doesn’t cover facts and methods. It specifically covers creative expressions. That’s why patents are different from copyright. If you read some woodworking books and then write your own online tutorial about building a chair using the methods and procedures described in that book, it doesn’t matter that you now compete with the books that you used, provided you didn’t copy the creative elements. How much of a chair design is creative and how much is function is an ambiguous question that might still land you in court, but it won’t be over your right to make the tutorial in the first place.
As the judge noted in this ruling, copyright isn’t intended to protect authors from competition. Copyright doesn’t protect Rowling from other authors writing YA wizard books cutting into her revenue streams. Or from TV producers making YA wizard shows that reduce the demand for books. Copyright doesn’t protect the Tolkien estate from Terry Brooks, or Tracy Hickman or Margret Weiss reducing the demand for Tolkien fantasy by supplanting it with their own fantasies.
The same argument applies to someone who learned from the book and wrote an article explaining the idea to someone else.
If you paid a human author to do the same you’d be breaking no law. Learning is the point of books existing in the first place.
Humans learning, not machines learning is the point of books.
The farce of treating a corporation as an individual precludes common sense legal procedure to investigate people who are responsible for criminal action taken by the company. Its obviously premeditated and in all ways an illicit act knowingly perpetrated by persons. The only discourse should be about upending this penthouse legalism.
The “farce” of treating a corporation as a legal individual is the reason you can have this case in the first place. Otherwise the authors would have had to discover and individually sue each specific individual in the company for each specific claim. They would have to find the specific individual that downloaded their specific book and sue that person. Then they would need to find the specific individual that digitized their specific book and sue that person. Then they would need to find the specific person that loaded that digital copy into an AI model and sue that person. And on and on for each alleged act of infringement.
Or we could recognize that’s silly when we’re talking about a group of people acting in concert and treat them as a single entity for the purpose of alleged crimes. Which is what we do when we treat a corporation as an individual for legal purposes.
The irony is that actually litigating copyright law would lead to the repeal of said copyright law. And so in all cases of backwaters laws that are used to "protect interests" of "corporations" yet criminalize petty individual cases.
This of course cannot be allowed to happen, so the the legal system is just a limbo, a bar which regular individuals must strain to pass under but that corporations regularly overstep.
They've all done that, it should be obvious by now. Training on just freely available data only gets you so far.
I've begun to wonder if this is why some large torrent sites haven't been taken down. They are essentially able to crowdsource all the work. There are some users who spend ungodly amounts of time and money on these sites that I suspect are rich industry benefactors.
If Anthropic is funded by Amazon, they should have just asked Amazon for unlimited download of EVERY book in the Amazon book store, and all audio-books as well. It certainly would be faster than buying one copy of each and tearing it apart.
Hang on, it is OK under copyright law to scan a book I bought second hand, destroy the hard copy and keep the scan in my online library? That doesn't seem to chime with the copyright notices I have read in books.
First sale doctrine gives the person who sold the book you bought the right to sell it to you. Fair Use permits you to scan your copy, used or new. It's your book, you can destroy it. But you have to delete your digital copy if you sell it or give it away. And you can't distribute your digital copy.
Fair use can be a pretty gray area and details matter, but copying for personal use is frequently okay.
> That doesn't seem to chime with the copyright notices I have read in books.
You shouldn't get your legal advice from someone with skin in the game.
> That doesn't seem to chime with the copyright notices I have read in books.
I used to get scared by such verbiage. Courts ruled decades ago that many of those uses are actually permitted, under very common conditions (e.g. not distributing, etc). Yes, you totally can photocopy a book you own, for your own purposes.
The article doesn't say who is suing them. Is it a class action? How many of these 7M pirated books have they written? Is it publishing houses? How many of these books are relevant in this judgement?
as far as I understand while training on books is clearly not fair use (as the result will likely hurt the lively hood of authors, especially not "best of the best" authors).
as long as you buy the book it still should be legal, that is if you actually buy the book and not a "read only" eBook
but the 7_000_000 pirated books are a huge issue, and one from which we have a lot of reason to believe isn't just specific to Anthropic
Buying a copy of a book does not give you license to take the exact content of that book, repackage it as a web service, and sell it to millions of others. That's called theft.
> "Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different," he wrote.
But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it. And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material. Also people do not read millions of books to become a writer.
> But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it.
The analogy refers to humans using machines to do what would already be legally if they did it manually.
> And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material.
[Citation needed], and not a legal argument.
> Also people do not read millions of books to become a writer.
But people do hear millions of words as children.
> But people do hear millions of words as children.
At a rate 1000 words/day it takes 3 years to hear a million words. Also "million words" is not equal to "million books". Humans are ridiculously efficient in learning compared to LLMs.
It is shocking how courts have being ruling towards the benefits of ai companies despite the obvious problem of allowing automatic plagiarism
Information wants to be free
Then why do they sell their services instead of putting the model in open source?
Not really, plagiarism is not a legal concept.
When Aaron Schwartz did it, he ended up dying.
It’s marginally better than Meta torrenting z-lib.
I'm curious - do the people here who think copyright shouldn't exist also think trademark shouldn't exist?
1980's: Johnny No. 5 need input!
2020's: (Steals a bunch of books to profit off acquired knowledge.)
The title is clearly meant to generate outrage, but what is wrong with cutting up a book that you own?
[flagged]
They very clearly had a reason.
poverty mindset. We can make more books, and now these copies contribute to a corpus of knowledge that far more people benefit from
3 replies →
Two of the top AI companies flouted ethics with regard to training data. In OpenAI's case, the whistleblower probably got whacked for exposing it.
Can anyone make a compelling argument that any of these AI companies have the public's best interest in mind (alignment/superalignment)?
[dead]
under the DMCA the minimum penalty for an illegally downloaded file is $750 (https://copyrightresource.uw.edu/copyright-law/dmca/)
"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling
If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.
Should have listened to those NordVPN ads on YouTube
Hopefully they were all good books at least.
they pirated the best ones, according to the authors
So, how should we as a society handle this?
Ensure the models are open source, so everyone can use them, as everyones data is in there?
Close those companies and force them to delete the models, as they used copyright material?
no one said anything when Google did it way before LLM is a thing
So if you incorporate you can do whatever you want without criminal charges?
Most of the comments missed the point. It's not that they trained on books, it's that they pirated the books.
seems like the "mis" is missing from the name.
The solution has always been: show us the training data.
As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.
On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.
Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.
My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.
[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).
[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...
[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...
[3] https://companiesmarketcap.com/
[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!
From Vinge's "Rainbow's End":
> In fact this business was the ultimate in deconstruction: First one and then the other would pull books off the racks and toss them into the shredder's maw. The maintenance labels made calm phrases of the horror: The raging maw was a "NaviCloud custom debinder." The fabric tunnel that stretched out behind it was a "camera tunnel...." The shredded fragments of books and magazine flew down the tunnel like leaves in tornado, twisting and tumbling. The inside of the fabric was stitched with thousands of tiny cameras. The shreds were being photographed again and again, from every angle and orientation, till finally the torn leaves dropped into a bin just in front of Robert. Rescued data. BRRRRAP! The monster advanced another foot into the stacks, leaving another foot of empty shelves behind it.
Yes, I was thinking of this passage as well. The technology does not seem to have advanced to this particular point yet.
Good, this is what Aaron Swartz was fighting for.
Against companies like Elsevier locking up the worlds knowledge.
Authors are no different to scientists, many had government funding at one point, and it's the publishing companies that got most of the sales.
You can disagree and think Aaron Swartz was evil, but you can't have both.
You can take what Anthropic have show you is possible and do this yourself now.
isohunt: freedom of information
Maybe to give something back to the pirates, Anthropic could upload all the books they have digitized to the archive? /s
Everybody that wants to train an LLM, should buy every single book, every single issue of a magazine or a newspaper, and personally ask every person that ever left a comment on social media. /s
If I was China I would buy every lawyer to drown western AI companies in lawsuits, because it's an easy way to win AI race.
I will never feel bad again for learning from copied books /S
[flagged]
He downloaded millions of academic articles and the government charged him with multiple felonies.
The difference is, Aaron Swartz wasn't planning to build massive datacenters with expensive Nvidia servers all over the world.
>the government charged him with multiple felonies.
This was the result of a cruel and zealous overreach by the prosecutor to try to advance her political career. It should never have gone that far.
The failure of MIT to rally in support of Aaron will never be forgiven.
1 reply →
It's even worse considering all he downloaded was in public domain so it was much less problematic considering copyright.
Lesson is simple. If you want to break a law make sure it is very profitable because then you can find investors and get away with it. If you play robin hood you will be met with a hammer.
[flagged]
Make sure you have a few billion dollars ready so you can pay a few million on the lawsuits. A volcano getting a cup of water poured into it.
Amazon has been doing this since the 2000's. Fun fact: This is how AWS came about; for them to scale its "LOOK INSIDE!" feature for all the books it was hoovering in an attempt to kill the last benefit the bookstore had over them.
Ie. This is not a big deal. The only difference now is ppl are rapidly frothing to be outraged by the mere sniff of new tech on the horizon. Overton window in effect.