← Back to context

Comment by palata

1 day ago

Genuine question: if I train my model with copyleft material, how do you prove I did?

Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.

I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.

Sometime, LLMs actually generate copyright headers as well in their output - lol - like in this PR which was the subject of a recent HN post [1]

https://news.ycombinator.com/item?id=46039274

  • I once had a well-known LLM reproduce pretty much an entire file from a well-known React library verbatim.

    I was writing code in an unrelated programming language at the time, and the bizarre inclusion of that particular file in the output was presumably because the name of the library was very similar to a keyword I was using in my existing code, but this experience did not fill me with confidence about the abilities of contemporary AI. ;-)

    However, it did clearly demonstrate that LLMs with billions or even trillions of parameters certainly can embed enough information to reproduce some of the material they were trained on verbatim or very close to it.

  • So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.

    • If you have seen say, for example, the Windows source code, you cannot take certain jobs implementing Windows-compatible interfaces that are supposed to be free from Microsoft IP. One could say your brain has been "infected". The same is true of many things around intellectual property.

    • There is a stupid presupposition that LLMs are equivalent to human brains which they clearly are not. Stateless token generators are OBVIOUSLY not like human brains even if you somehow contort the definition of intelligence to include them

      6 replies →

    • > Doesn't mean my brain is GPLed.

      It would be if they could get away with it. The likes of Disney would delete your memories of their films if they could get away with it. If you want to enjoy the film, you should have to pay them for the privilege, not recall the last time you watched it.

      1 reply →

    • not your brain, but the code you produce if it includes portions of GPL code that you remembered.

    • > So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.

      Your brain is part of you. Some might say it is your very essence. You are human. Humans have inalienable rights that sometimes trump those enshrined by copyright. One such right is the right to remember things you've read. LLMs are not human, and thus don't enjoy such rights.

      Moreover, your brain is not distributed to other people. It's more like a storage medium than a distribution. There is a lot less furore about LLMs that are just storage mediums, and where they themselves or their outputs are not distributed. They're obviously not very useful.

      So your analogy is poor.

      1 reply →

> Genuine question: if I train my model with copyleft material, how do you prove I did?

An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?

In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?

Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?

  • > even if your model was trained strictly on copyleft material

    That's not legal use of the material according to most copyleft licenses. Regardless if you end up trying to reproduce it. It's also quite immoral if technically-strictly-speaking-maybe-not-unlawful.

    • > That's not legal use of the material according to most copyleft licenses.

      That probably doesn't matter given the current rulings that training an AI model on otherwise legally acquired material is "fair use", because the copyleft license inherently only has power because of copyright.

      I'm sure at some point we'll see litigation over a case where someone attempts to make "not using the material to train AI" a term of the sales contract for something, but my guess would be that if that went anywhere it would be on the back of contract law, not copyright law.

      1 reply →

    • I have referenced words in the comment I was replying to, you can safely substitute "copyleft" with "public domain" and the argument still stands. Your comment focusing on minutiae of training, however, highlights how relevant the discussion around outputs in particular is.

      edit: wording.

> Genuine question: if I train my model with copyleft material, how do you prove I did?

It may produce it when asked

https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...

  • > It may produce it when asked

    that's not proof - it may also be intelligent enough to have produce similar expressions without the original training data.

    Not to mention that having knowledge of copyrighted material is not in violation of any known copyright law - after all, human brains also have the knowledge after learning it. The model, therefore, cannot be in violation regardless of what data was used to train it (as long as that data was not obtained illegally).

    If someone _chooses_ to use the LLM to reproduce harry potter, or some GPL'ed code, then that person would be in violation of the relevant copyright laws. The copyright owner needs to pursue that person, rather than the owner of the LLM. In the exact same way that if someone used Microsoft Word to reproduce harry potter, microsoft would not have any liability.

> Genuine question: if I train my model with copyleft material, how do you prove I did?

discovery via lawyers

I've thought about this as well, especially for the case when it's a company owned product that is AGPLed. It's a really tough situation, because the last thing we want is competitors to come in and LLM wash our code to benefit their own product. I think this is a real risk.

On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.

At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.

I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.

You need low level access to the AI in question, and a lot of compute, but for most AI types, you can infer whether a given data fragment was in the training set.

It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.

Now, would that be enough to put the entire AI under GPL? I doubt it.

By reverse inference and model inversion. We can determine what content a pathway has been trained on. We can find out if it’s been trained on GPL material.

Its why I stopped contributing to open source work. Its pretty clear in the age of LLMs that this breach of the license under which it is written will be allowed to continue and that open source code will be turned into commercial products.

> Genuine question: if I train my model with copyleft material, how do you prove I did?

Discovery.

There's the other side of this issue. The current position of the U.S. Copyright Office is that AI output is not copyrightable, because the Constitution's copyright clause only protects human authors. This is consistent with the US position that databases and lists are not copyrightable.[1]

Trump is trying to fire the head of the U.S. Copyright Office, but they work for the Library of Congress, not the executive branch, so that didn't work.[2]

[1] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

[2] https://apnews.com/article/trump-supreme-court-copyright-off...

> Should I keep open sourcing my code now that the licence doesn't matter anymore?

your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!

genuine question: why you are training your model with content that explicitly will have requirements violated if you do?