Comment by asadotzler

6 months ago

OpenAI doesn't respect copyright so why would they let a verbal agreement get in the way of billion$

71 comments

asadotzler

Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

pseudo0 6 months ago
Their argument is that using copyrighted data for training is transformative, and therefore a form of fair use. There are a number of ongoing lawsuits related to this issue, but so far the AI companies seem to be mostly winning. Eg. https://www.reuters.com/legal/litigation/openai-gets-partial...
Some artists also tried to sue Stable Diffusion in Andersen v. Stability AI, and so far it looks like it's not going anywhere.
In the long run I bet we will see licensing deals between the big AI players and the large copyright holders to throw a bit of money their way, in order to make it difficult for new entrants to get training data. Eg. Reddit locking down API access and selling their data to Google.
- qwertox 6 months ago
  
  So anyone downloading any content like ebooks and movies is also just performing transformative actions. Forming memories, nothing else. Fair use.
  
  11 replies →
ThrowawayR2 6 months ago
The FSF funded some white papers a while ago on CoPilot: https://www.fsf.org/news/publication-of-the-fsf-funded-white.... Take a look at the analysis by two academics versed in law at https://www.fsf.org/licensing/copilot/copyright-implications... starting with §II.B that explains why it might be legal.
Bradley Kuhn also has a differing opinion in another whitepaper there (https://www.fsf.org/licensing/copilot/if-software-is-my-copi...) but then again he studied CS, not law. Nor has the FSF attempted AFAIK to file any suits even though they likely would have if it were an open and shut case.
- sitkack 6 months ago
  
  All of the most capable models I use have been clearly trained on the entirety of libgen/z-lib. You know it is the first thing they did, it is like 100TB.
  Some of the models are even coy about it.
  
  2 replies →
Filligree 6 months ago
A lot of people want AI training to be in breach of copyright somehow, to the point of ignoring the likely outcomes if that were made law. Copyright law is their big cudgel for removing the thing they hate.
However, while it isn't fully settled yet, at the moment it does not appear to be the case.
- elashri 6 months ago
  
  A lot of people have problem with selective enforcement of copyright law. Yes, changing them because it is captured by greedy cooperations would be something many would welcome. But currently the problem is that for normal folks doing what openai is doing they would be crushed (metaphorically) under the current copyright law.
  So it is not like all people who problems with openAI is big cudgel. Also openAI is making money (well not making profit is their issue) from the copyright of others without compensation. Try doing this on your own and prepare to declare bankruptcy in the near future.
  
  10 replies →
- somenameforme 6 months ago
  
  A more fundamental argument would be that OpenAI doesn't have a legal copy/license of all the works they are using. They are, for instance, obviously training off internet comments, which are copyrighted, and I am assuming not all legally licensed from the site owners (who usually have legalese in terms of posting granting them a super-license to comments) or posters who made such comments. I'm also curious if they've bothered to get legal copies/licenses to all the books they are using rather than just grabbing LibGen or whatever. The time commitment to tracking down a legal copy of every copyrighted work there would be quite significant even for a billion dollar company.
  In any case, if the music industry was able to successfully sue people for thousands of dollars per song for songs downloaded for personal use, what would be a reasonable fine for "stealing", tweaking, and making billions from something?
Yizahi 6 months ago
"When I was a kid, I was praying to a god for bicycle. But then I realized that god doesn't work this way, so I stole a bicycle and prayed to a god for forgiveness." (c)
Basically a heist too big and too fast to react. Now every impotent lawmaker in the world is afraid to call them what they are, because it will inflict on them wrath of both other IT corpos an of regular users, who will refuse to part with a toy they are now entitled to.
- artisin 6 months ago
  
  An all-time favorite quip from Emo Philips on How God Works[1]
  [1] https://youtu.be/qegPkqs6rFw
- gunian 6 months ago
  
  if we were honest about the world God actually encourages pillaging :) to the victor go the spoils and the narrative of history
alphan0n 6 months ago
Simply put, if the model isn’t producing an actual copy, they aren’t violating copyright (in the US) under any current definition.
As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.
If I use a copy machine to reproduce your copyrighted work, I am responsible for that infringement not Xerox.
If I coax your copyrighted work out of my phones keyboard suggestion engine letter by letter, and publish it, it’s still me infringing on your copyright, not Apple.
If I make a copy of your clip art in Illustratator, is Adobe responsible? Etc.
Even if (as I’ve seen argued ad nauseaum) a model was trained on copyrighted works on a piracy website, the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.
Not to mention, I can walk into any public library and learn something from any book there, would I then owe the authors of the books I learned from a fee to apply that knowledge?
- lmm 6 months ago
  
  > the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.
  Someone who just reads the material doesn't infringe. But someone who copies it, or prepares works that are derivative of it (which can happen even if they don't copy a single word or phrase literally), does.
  > would I then owe the authors of the books I learned from a fee to apply that knowledge?
  Facts can't be copyrighted, so applying the facts you learned is free, but creative works are generally copyrighted. If you write your own book inspired by a book you read, that can be copyright infringement (see The Wind Done Gone). If you use even a tiny fragment of someone else's work in your own, even if not consciously, that can be copyright infringement (see My Sweet Lord).
  
  21 replies →
- yokem55 6 months ago
  
  > As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.
  Where this breaks down though is that contributory infringement is a still a thing if you offer a service aids in copyright infringement and you don't do "enough" to stop it.
  Ie, it would all be on the end user for folks that self host or rent hardware and run an LLM or Gen Art AI model themselves. But folks that offer a consumer level end to end service like ChatGPT or MidJourney could be on the hook.
  
  2 replies →
bhouston 6 months ago

> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
Uber showed the way. They initially operated illegally in many cities but moved so quickly as to capture the market and then they would tell the city that they need to be worked with because people love their service.
https://www.theguardian.com/news/2022/jul/10/uber-files-leak...
jcranmer 6 months ago

The short answer is that there is actually a number of active lawsuits alleging copyright violation, but they take time (years) to resolve. And since it's only been about two years since we've had the big generative AI blow up, fueled by entities with deep pockets (i.e., you can actually profit off of the lawsuit), there quite literally hasn't been enough time for a lawsuit to find them in violation of copyright.
And quite frankly, between the announcement of several licensing deals in the past year for new copyrighted content for training, and the recent decision in Warhol "clarifying" the definition of "transformative" for the purposes of fair use, the likelihood of training for AI being found fair is actually quite slim.
AdieuToLogic 6 months ago
> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
"Move fast and break things."[0]
Another way to phrase this is:
Move fast enough while breaking things and regulations can never catch up.
0 - https://quotes.guide/mark-zuckerberg/quote/move-fast-and-bre...
cmrdporcupine 6 months ago

You'll find people on this forum especially using the false analogy with a human. Like these things are like or analogous to human minds, and human minds have fair use access, so why shouldn't a these?
Magical thinking that just so happens to make lots of $$. And after all why would you want to get in the way of profit^H^H^Hgress?
qwertox 6 months ago

I wonder if Google can sue them for downloading the YouTube videos plus automatically generated transcripts in order to train their models.
And if Google could enforce removal of this content from their training set and enforce a "rebuild" of a model which does not contain this data.
Billion-dollar lawsuits.
musicale 6 months ago

It worked for Napster for a while.
davidcbc 6 months ago

They're a rich company, they are immune from consequences
marxisttemp 6 months ago

“There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.”
scotty79 6 months ago

It's because the copyright is fake and the only thing supporting it were million dollar business. It naturally crumbles while facing billion dollar business.

Xcelerate 6 months ago

Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.

I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.