The current state of the theory that GPL propagates to AI models

1 day ago (shujisado.org)

293 comments

jonymo

Great article but I don't really agree with their take on GPL regarding this paragraph:

> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.

The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.

> What is important is how to realize the “freedom of software,” which is the philosophy of open source

Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).

I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?

faxmeyourcode 13 hours ago

There are many misconceptions of the GPL, gnu, and free software movement. I love the idealism of free software and you hit the nail on the head.

Below are the four freedoms for those who are interested. Straight from the horse's mouth: https://www.gnu.org/philosophy/free-sw.html

    The freedom to run the program as you wish, for any purpose (freedom 0).

    The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.

    The freedom to redistribute copies so you can help others (freedom 2).

    The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

froh 1 day ago
> The spirit of the GPL is the freedom of the user, not the code being freely shared.
who do you mean by "user"?
the spirit is that the person who actually uses the software also has the freedom to modify it, and that the users recovering these modifications have the same rights.
is that what you meant?
and while technically that's the spirit of the GPL, the license is not only about users, but about a _relationship_, that of the user and the software and what the user is allowed to do with the software.
it thus makes sense to talk about "software freedom".
last not least, about a single GPL function --- many GPL _libraries_ are licensed less restrictively, LGPL.
- m463 21 hours ago
  
  I don't think you understand the GPL.
  > "the user is allowed to do with the software"
  The GPL does not restrict what the user does with the software.
  It can be USED for anything.
  But it does restrict how you redistribute it. You have responsibilities if you redistribute it. You must provide the source code, and pass on the same freedoms you received to the users you redistribute it to.
  
  5 replies →
themafia 19 hours ago
> The virality is a byproduct to ensure the software is not stolen from their users.
If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
> Freedom of software means nothing.
Software is information. Does "freedom of information" mean nothing? I think you're narrowing concepts here into something not particularly useful or reflective of reality.
> Users get the freedom to enjoy the software how they like.
The freedom is to modify the code for my own purposes. This is not at all required to plainly "enjoy" the software. I instead "enjoy a particular benefit."
> Why would my software which only contains a single function not be fair use?
Because fair use implies educational, informational, or transformational outputs. Your software is none of those things.
- Brian_K_White 16 hours ago
  
  "If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way."
  Yes you are. You are just deprived of something you apparently don't recognize or value, but that doesn't make it ok.
  The original author was also stolen from and that doesn't rely on your understanding or perception.
  The original author set some terms. Therm were not money but they are terms exactly like money. They said "you can have this, and only price is you have to make the source, and the further right to redistribute, available to any user you hand a binary to.
  Well MS handed you a binary and did not also hand you the source or the right to redistribute.
  That stole from both you and the original author and me who might otherwise have benefited from your own child work. The fact that you personally apparently were never going to make use of something they owe you doesn't change the fact that they owe you, and the original author and me.
  
  1 reply →
- inlined 18 hours ago
  
  As a user I suffer from not being able to freely use or derive my own work from Microsoft’s
  
  7 replies →
- faxmeyourcode 13 hours ago
  
  > If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
  The user in this example is deprived of freedoms 1, 2, and 3 (and probably freedom 0 as well if there are terms on what machines you can run the derivative binary on).
  Read more here: https://www.gnu.org/philosophy/free-sw.html
  Whether or not the user values these freedoms is another thing entirely. As the software author, licensing your code under the GPL is making a conscious effort to ensure that your software is and always will be free (not just as in beer) software.
CamperBob2 1 day ago
The GPL arose from Stallman's frustration at not having access to the source code for a printer driver that was causing him grief.
In a world where he could have just said "Please create a PDP-whatever driver for an IBM-whatever printer," there never would have been a GPL. In that sense AI represents the fulfillment of his vision, not a refutation or violation.
I'd be surprised if he saw it that way, of course.
- belorn 21 hours ago
  
  The safeguards will prevent the AI from reproducing the proprietary drivers for the IBM-whatever printer, and it will not provide code that breaks the DRM that exist to prevent third-party drivers from working with the printer. There will however be no such safeguards or filters to prevent IBM to write a proprietary driver for their next printer, using existing GPL drivers as a building block.
  Code will only ever go in one direction here.
  
  12 replies →
- saurik 21 hours ago
  
  But that isn't the same code that you were running before. And like, let's not forget GPLv3: "please give me the code for a mobile OS that could run on an iPhone" does not in any way help me modify the code running on MY iPhone.
  
  2 replies →
- xorcist 6 hours ago
  
  The only legal way to do that in the proprietary software world is a clean room implementation.
  An AI could never do a clean room implementation of anything, since it was not trained on clean room materials alone. And it never can be, for obvious reasons. I don't think there's an easy way out here.
- dzaima 17 hours ago
  
  In said hypothetical world, though, the whatever-driver would also have been written by LLMs; and, if the printer or whatever is non-trivial and made by a typical large company, many LLM instances with a sizable amount of token spending over a long period of time.
  So getting your own LLM rewrite to an equivalent point (or, rather, less buggy as that's the whole point!) would be rather expensive; at the absolute very least, certainly more expensive than if you still had the original source code to reference or modify (even if an LLM is the thing doing those). Having the original source code is still just strictly unconditionally better.
  Never mind the question of how you even get your LLM to reverse-engineer & interact with & observe the physical hardware of your printer, and whatever wasted ink during debugging of the reinvention of what the original driver already did correctly.
  
  4 replies →

palata 1 day ago

Genuine question: if I train my model with copyleft material, how do you prove I did?

Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.

I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.

bwfan123 1 day ago
Sometime, LLMs actually generate copyright headers as well in their output - lol - like in this PR which was the subject of a recent HN post [1]
https://news.ycombinator.com/item?id=46039274
- Chris_Newton 1 day ago
  
  I once had a well-known LLM reproduce pretty much an entire file from a well-known React library verbatim.
  I was writing code in an unrelated programming language at the time, and the bizarre inclusion of that particular file in the output was presumably because the name of the library was very similar to a keyword I was using in my existing code, but this experience did not fill me with confidence about the abilities of contemporary AI. ;-)
  However, it did clearly demonstrate that LLMs with billions or even trillions of parameters certainly can embed enough information to reproduce some of the material they were trained on verbatim or very close to it.
- quotemstr 1 day ago
  
  So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.
  
  15 replies →
friendzis 1 day ago
> Genuine question: if I train my model with copyleft material, how do you prove I did?
An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?
In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?
Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?
- isodev 1 day ago
  
  > even if your model was trained strictly on copyleft material
  That's not legal use of the material according to most copyleft licenses. Regardless if you end up trying to reproduce it. It's also quite immoral if technically-strictly-speaking-maybe-not-unlawful.
  
  2 replies →
david_allison 1 day ago
> Genuine question: if I train my model with copyleft material, how do you prove I did?
It may produce it when asked
https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...
- chii 13 hours ago
  
  > It may produce it when asked
  that's not proof - it may also be intelligent enough to have produce similar expressions without the original training data.
  Not to mention that having knowledge of copyrighted material is not in violation of any known copyright law - after all, human brains also have the knowledge after learning it. The model, therefore, cannot be in violation regardless of what data was used to train it (as long as that data was not obtained illegally).
  If someone _chooses_ to use the LLM to reproduce harry potter, or some GPL'ed code, then that person would be in violation of the relevant copyright laws. The copyright owner needs to pursue that person, rather than the owner of the LLM. In the exact same way that if someone used Microsoft Word to reproduce harry potter, microsoft would not have any liability.
blibble 1 day ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
discovery via lawyers
freedomben 1 day ago

I've thought about this as well, especially for the case when it's a company owned product that is AGPLed. It's a really tough situation, because the last thing we want is competitors to come in and LLM wash our code to benefit their own product. I think this is a real risk.
On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.
At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.
I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.
ACCount37 1 day ago

You need low level access to the AI in question, and a lot of compute, but for most AI types, you can infer whether a given data fragment was in the training set.
It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.
Now, would that be enough to put the entire AI under GPL? I doubt it.
uhfraid 14 hours ago

> Like if there is no way to trace it back to the original material, does it make sense to regulate it?
Training data extraction has seen some success, tracing should be possible for at least some of it
https://arxiv.org/abs/2311.17035
reactordev 18 hours ago

By reverse inference and model inversion. We can determine what content a pathway has been trained on. We can find out if it’s been trained on GPL material.
PaulKeeble 1 day ago

Its why I stopped contributing to open source work. Its pretty clear in the age of LLMs that this breach of the license under which it is written will be allowed to continue and that open source code will be turned into commercial products.
LexiMax 18 hours ago

> Genuine question: if I train my model with copyleft material, how do you prove I did?
Discovery.
Animats 1 day ago

There's the other side of this issue. The current position of the U.S. Copyright Office is that AI output is not copyrightable, because the Constitution's copyright clause only protects human authors. This is consistent with the US position that databases and lists are not copyrightable.[1]
Trump is trying to fire the head of the U.S. Copyright Office, but they work for the Library of Congress, not the executive branch, so that didn't work.[2]
[1] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
[2] https://apnews.com/article/trump-supreme-court-copyright-off...
basilgohar 1 day ago

Maybe we should requiring training data be published or at least referenced.
mistrial9 1 day ago

> Should I keep open sourcing my code now that the licence doesn't matter anymore?
your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!
luqtas 1 day ago
genuine question: why you are training your model with content that explicitly will have requirements violated if you do?
- 1gn15 1 day ago
  
  out of pure spite for hypocritical "hackers"
isodev 1 day ago
> Genuine question: if I train my model with copyleft material, how do you prove I did?
The burden is on you to prove that you didn't.
- palata 1 day ago
  
  No it is not. It is exactly how the burden of proof works.
  https://en.wikipedia.org/wiki/Burden_of_proof_(law)
ForHackernews 1 day ago

https://www.penny-arcade.com/comic/2024/01/19/fypm
Anything you produce will be consumed and regurgitated by the machine. It's a personal question for everyone whether you choose to keep providing grist for their mills.

zamadatix 1 day ago

The article goes deep into these two cases deemed most relevant but really there are a wide swath of similar cases all focused around defining sharper borders than ever around what is essentially the question "exactly when does it become copyright violation" with plenty of seemingly "obvious" answers which quickly conflict with each other.

I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.

Not a lawyer, just excited to see the outcomes :).

twoodfin 1 day ago
Ideally, Congress would just settle this basket of copyright concerns, as they explicitly have the power to do—and have done so repeatedly in the specific context of computers and software.
- tpmoney 1 day ago
  
  I've pitched this idea before but my pie in the sky hope is to settle most of this with something like a huge rollback of copyright terms, to something like 10 or 15 years initially. You can get one doubling of that by submitting your work to an official "library of congress" data set which will be used to produce common, clean, and open models that are available to anyone for a nominal fee and prevent any copyright claims against the output of those models. The money from the model fees is used to pay royalties to people with materials in the data set over time, with payouts based on recency and quantity of material, and an absolute cap to discourage flooding the data sets to game the payments.
  This solution to me amounts to an "everybody wins" situation, where producers of material are compensated, model trainers and companies can get clean, reliable data sets without having to waste time and energy scraping and digitizing it themselves, and model users can have access to a number of known "safe" models. At the same time, people not interested in "allowing" their works to be used to train AIs and people not interested in only using the public data sets can each choose to not participate in this system, and then individually resolve their copyright disputes as normal.
  
  1 reply →
- jeremyjh 1 day ago
  
  What is ideal about getting more shitty laws written at the behest of massive tech companies? Do you think the DMCA is a good thing?
  
  3 replies →

myrmidon 1 day ago

I honestly think that the most extreme take that "any output of an LLM falls under all the copyright of all its training data" is not really defensible, especially when contrasted with human learning, and would be curious to hear conflicting opinions.

My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.

/sidenote: Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.

I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.

jeremyjh 1 day ago
Anyone can very easily avoid training on GPL code. Yes, the model might be not be as strong as one that is trained that way and released under terms of the GPL, but to me that sounds like quite a good outcome if the best models are open source/open weight.
Its all about whose outcomes are optimized.
Of course, the law generally favors consideration of the outcomes for the massive corporations donating hundreds of millions of dollars to legislature campaigns.
- myrmidon 1 day ago
  
  Would it even actually help to go down that road though? IMO the expected outcome would simply be that AI training stalls for a bit while "unencumbered" training material is being collected/built up and you achieve basically nothing in the end, except creating a big ongoing logistical/administrative hassle to keep lawyers/bureaucrats fed.
  I think the redistribution effect (towards training material providers) from such an scenario would be marginal at best, especially long-term, and event that might be over-optimistic.
  I also dislike that stance because it seems obviously inconsistent to me-- if humans are allowed to train on copyrighted material without their output being generally affected, why not machines?
cardanome 20 hours ago
> I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed
Not sure about undesirable, I so wish we could just ban all generative AI.
I feel profound sadness of having lost the world we had before generative AI became widespread. I really loved programming and seeing my trade devalued with vibe coding is just heart breaking. We will see mass unemployment, deep fakes, more AI induced psychosis, a devaluing of human art. I hate this new world.
It would be the morally correct thing to bann generative AI as it only benefits corporations and doesn't improve the life of the people but makes it worse.
The training of the big LLMs has been criminal. Whether we talk about GPL licensed code or the millions of artist that never released their work under a specific license and would never haven consented to it being used for training.
I still think states will allow it and legalize the crime because they believe that AI offer competitive advantages and they will fear "falling behind". Plus military use.
- redox99 17 hours ago
  
  In my opinion programming has never been this much fun. The vast vast majority of code is repetitive stuff that now is a breeze. I can build so much stuff now, and with more beautiful code because refactoring is effortless.
  I think it's like going from pre industrial revolution manual labor, to modern tools and machines.
  
  2 replies →
jay_kyburz 16 hours ago

Reading your comment made me think about the other-side of the equation. I think it's generally considered that AI generated works are not themselves protected by copyright, I wonder if code with little to no human intervention become un-licenable.
You don't have any rights to assert when you have AI write the code for you.
wizzwizz4 1 day ago
Human learning is materially different from LLM training. They're similar in that both involve providing input to a system that can, afterwards, produce output sharing certain statistical regularities with the input, including rote recital in some cases – but the similarities end there.
- gruez 21 hours ago
  
  >Human learning is materially different from LLM training [...] but the similarities end there.
  Specifically what "material differences" are there? The only arguments I heard are are around human exceptionalism (eg. "brains are different, because... they just are ok?"), or giving humans a pass because they're not evil corporations.
  
  2 replies →
- IshKebab 19 hours ago
  
  Why? I'm pretty sure I can learn the lyrics of a song, and probabilistically output them in response to a prompt.
  Is the existence of my brain copyright infringement?
  The main difference I see (apart from that I bullshit way less than LLMs), is that I can't learn nearly as much as an LLM and I can't talk to 100k people at once 24/7.
  I think the real answer here is that AI is a totally new kind of copying, and it's useful enough that laws are going to have to change to accommodate that. What country is going to shoot itself in the foot so much by essentially banning AI, just so it can feel smug about keeping its 20th century copyright laws?
  Maybe that will change when you can just type "generate a feature length Pixar blockbuster hit", but I don't see that happening for quite a long time.

naveen99 3 hours ago

Time limit on patents is supportive of GPL in the limit.

Public trading of most trade secrets along with their owner corporations is also GPLish.

graemep 1 day ago

The article repeatedly treats license and contract as though they are the same, even though the sidebar links to a post that discusses the difference.

A lot of it boils down to whether training an LLM is a breach of copyright of the training materials which is not specific to GPL or open source.

xgulfie 1 day ago
And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
- gruez 21 hours ago
  
  >And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
  What "lobbied"? Copyright law hasn't materially changed since AI got popular, so I'm not sure where these lobbying efforts are showing up in. If anything the companies that have lobbied hard in the past (eg. media companies) are opposed to the current status quo, which seems to favor AI companies.
- graemep 1 day ago
  
  I am really surprised that media businesses, which are extremely influential around the world, have not pushed back against this more. I wonder whether they are looking at cost savings that will get from the technology as a worthwhile trade-off.
  
  3 replies →
- rileymat2 1 day ago
  
  All theirs, if they properly obtained the copy.
  This is a big difference that already has bit them.
- exasperaited 1 day ago
  
  In practice it wouldn't matter a whit if they lobbied for it or not.
  Lobbying is for people trying to stop them; externalities are for the little people.
maxloh 1 day ago
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Once training is established as fair use, it doesn't really matter if the license is MIT, GPL, or a proprietary one.
- blibble 1 day ago
  
  fair use only applies in the united states (and Poland, and a very limited set of others)
  https://en.wikipedia.org/wiki/Fair_use#/media/File:Fair_use_...
  and it is certainly not part of the Berne Convention
  in almost every country in the world even timeshifting using your VCR and ripping your own CDs is copyright infringement
  
  5 replies →
- mongol 1 day ago
  
  > To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
  Is this legally settled?
  
  1 reply →
- graemep 1 day ago
  
  That is just the sort of point I am trying to make. That is a copyright law issue, not a contractual one. If the GPL is a contract then you are in breach of contract regardless of fair use or equivalents.
OneDeuxTriSeiGo 1 day ago

It's not specific to open source but it's most clearly enforceable with open source as there will be many contributors from many jurisdictions with the one unifying factor being they all made their copyright available under the same license terms.
With proprietary or more importantly single-owner code, it's far easier for this to end up in a settlement rather than being drug out into an actual ruling, enforcement action, and establishment of precedence.
That's the key detail. It's not specific to GPL or open source but if you want to see these orgs held to account and some precedence established, focusing on GPL and FOSS licensed code is the clearest path to that.
kronicum2025 1 day ago
A GPL license is a contract in most other countries. Just not US probably.
- graemep 1 day ago
  
  That part of the article is about US cases, so its US law that applies.
  > A GPL license is a contract in most other countries. Just not US probably.
  Not just the US. It may vary with version of the GPL too. Wikipedia claims its a civil law vs common law country difference - not sure the citation shows that though.

phplovesong 1 day ago

We need a new license that forbids all training. That is the only way to stop big corporations from doing this.

maxloh 1 day ago
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
- rileymat2 1 day ago
  
  It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
  But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
- justin_murray 1 day ago
  
  This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.
  
  5 replies →
- colechristensen 1 day ago
  
  I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
  
  1 reply →
- LtWorf 1 day ago
  
  Fair use was for citing and so on not for ripping off 100% of the content.
  
  4 replies →
mr_toad 1 day ago

Fair use doesn’t need a license, so it doesn’t matter what you put in the license.
Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.
tensor 21 hours ago

So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.
munchler 1 day ago
By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.
- psychoslave 1 day ago
  
  Isn’t it the very reason why we need cleanroom software engineering:
  https://en.wikipedia.org/wiki/Cleanroom_software_engineering
  
  1 reply →
- codedokode 1 day ago
  
  Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
  Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
  
  2 replies →
- bluefirebrand 17 hours ago
  
  There is absolutely no reason that LLMs (or Corporations) should have the same rights as humans
WithinReason 1 day ago
Wouldn't it be still legal to train on the data due to fair use?
- gus_massa 1 day ago
  
  I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
  
  10 replies →
- cryptonector 10 hours ago
  
  Not if it's an EULA and you make the bot click through an "I agree" button.
conartist6 20 hours ago
Why forbid it when you could do exactly what this post suggests: go explicit and say that by including this copyrighted material in AI training you consent to release of the model. And you clarify that the terms are contractual, and that training the model on data represents implicit acceptance of the terms.
- themafia 19 hours ago
  
  Taken to an extreme:
  "Why forbid selling drugs when you can just put a warning label on them? And you could clarify that an overdose is lethal."
  It doesn't solve any problems and just pushes enforcement actions into a hopelessly diffuse space. Meanwhile the cartel continues to profit and small time users are temporarily incarcerated.
  
  1 reply →
James_K 1 day ago
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
- Orygin 1 day ago
  
  My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
  
  8 replies →
- amszmidt 1 day ago
  
  It isn't the difficult, a license that forbids how the program is used is a non-free software license.
  "The freedom to run the program as you wish, for any purpose (freedom 0)."
  
  7 replies →
- tomrod 1 day ago
  
  Model weights, source, and output.
BeFlatXIII 1 day ago

How is that enforceable against the fly-by-night startups?
cryptonector 10 hours ago

So an EULA?
scotty79 1 day ago
We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.
- joegibbs 16 hours ago
  
  That wouldn't matter too much though - how often do you worry about competitors directly stealing your code? Either it's server-side, or it's obfuscated or it's compiled. Anyway there's never that much stuff that's so special that it needs big legal stuff to prevent it from being copied, and if the LLM produces it you can just use another LLM to copy the same feature. And say it's 99% LLM and 1% human, who's going to know what the 1% is that's not safe to copy?
- raincole 1 day ago
  
  It's more or less already the case though. Pure AI-generated works without human touches are not copyrightable.
  
  11 replies →
- palata 1 day ago
  
  But then we would need a way to prove that some code was LLM generated, right?
  Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
  
  3 replies →
- michaelmrose 1 day ago
  
  Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.
  
  2 replies →

ljlolel 1 day ago

And then also to all code made from the GPL’d ai model?

maxloh 1 day ago
A program's output is likely not owned by the program's authors. For example, if you create a document with Microsoft Word, you are the one who owns it, not Microsoft.
- javcasas 1 day ago
  
  You sure about that? Have you checked the 400-page EULA?
- pessimizer 1 day ago
  
  Unless the license says otherwise. The fact that Word doesn't (I wouldn't even be sure if that was true, honestly, especially for the online versions) doesn't mean anything.
  They could start selling a version of Word tomorrow that gives them the right to train from everything you type on your entire computer into any program. Or that requires you to relinquish your rights to your writing and to license it back from Microsoft, and to only be able to dispute this through arbitration. They could add a morals clause.
  
  1 reply →
- LtWorf 1 day ago
  
  If I take a song and convert it from .mp3 to .ogg, the resulting file has no copyright since it's the output of a program?

worthless-trash 6 hours ago

If its perfectly fine to 'learn from gpl code' This should mean that its perfectly fine to have a llm assist in clean room implementation/reverse engineering.

pessimizer 1 day ago

I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.

Corporations have always talked about the virality of GPL, sometimes but not always to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.

Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)

Wouldn't you just pull it out?

NateEag 1 day ago

If you were a thoughtful, careful, law-abiding business, yes.
I submit the evidence suggests the genAI companies have none of those attributes.
NiloCK 1 day ago
Not crazy - there's a rational self-interest in doing this.
But I'm not certain that the relevant players have the same consequence-fearing mindset that you do, and to be honest they're probably right. The theft is too great to calculate the consequences, and by the time it's settled, what are you gonna do - turn off Forster's machine?
I hope you're right in at least some cases!
- pessimizer 1 day ago
  
  > by the time it's settled
  Why would the GPL settle? Even more, who is authorized to settle for every author who used the GPL? If the courts decided in favor of the GPL, which I think would be likely just because of the age and pervasiveness of the GPL, they'd actually have to lobby Congress to write an exception to copyright rules for AI.
  A large part of the infrastructure of the world is built on the GPL, and the people who wrote it were clearly motivated by the protection that they thought that the GPL would give to what was often a charitable act, or even an act that would allow companies to share code without having to compete with themselves. I can't imagine too many judges just going "nope."
  
  2 replies →
ares623 1 day ago

Why do hard thing when easy thing do trick?
exasperaited 1 day ago
> I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.
Haha no.
https://windsurf.com/blog/copilot-trains-on-gpl-codeium-does...
And just in the last two days, AI generating LGPL headers (which it could not do if identifying LGPL code was pulled from the codebase) and misattributing authors:
https://devclass.com/2025/11/27/ocaml-maintainers-reject-mas...
- pessimizer 1 day ago
  
  Thanks for the links.
  That first link shows people actively pulling out GPL code in 2023 and marketing around that fact, though. That's not great evidence that they're not doing it now, especially if testing for if GPL code is still in there seems to be as easy as prompting with an incomplete piece of it.
  I'd think that companies could amass a collection of all known GPL code and test for it regularly in order to refine their methods for keeping it out.
  > (which it could not do if identifying LGPL code was pulled from the codebase)
  Are you sure about this? Linking to LGPL code is fine afaik. And why not train on code that linked to universally available libraries that are legal to use? Seems like one might even prefer it.
  Seems like this was rejected for size and slop reasons, not licensing. If the submitter of the PR isn't even fixing possibly hallucinated author's names, it's obvious that they didn't really read it. Debugging vibe coded stuff is like finding an indeterminate number of needles in a haystack.
  
  1 reply →

simgt 1 day ago

What triggers me is how insistant Claude Code is on adding "co-authored by Claude" in commits, in spite of my settings and an instruction in CLAUDE.md. I wish all these tech bros were as willing to credit the human shoulders on which their products are built. But they'd be much less successful in our current system if they were that kind of people.

euazOn 1 day ago
Try changing the system prompt or switch to opencode [0] - they allegedly reverse engineered Claude Code, and so the performance you get with Claude models should be very similar to Claude Code.
[0] https://github.com/sst/opencode
- patrick91 1 day ago
  
  there's an option for claude to disable co-authoring, see: https://code.claude.com/docs/en/settings
  { "includeCoAuthoredBy": false }
- simgt 1 day ago
  
  I've changed the settings and added the instruction to the prompt, hence my frustration :)

dmezzetti 1 day ago

As someone who has spent a fair amount of time developing open source software, I will say I genuinely dislike copyleft and GPL.

For those who are into freedom, I don't see how dictating how you use what you build in such a manner is in the spirit of free and open.

Just my opinion on it, to each their own on the matter.

myrmidon 1 day ago
I had a very similar view once, and have since understood that this is mainly a difference in perspective:
It's easy as a developer to slip into a role where you want to build/package (maybe sell) some software product with minimal obligations. BSD-likes are obviously great there.
But the GPL follows a different perspective: It tries to make sure that every user of any software product is always capable of tinkering and changing it himself, and the more permissive licenses do not help there because they don't prevent (or even discourage!) companies from just selling you stripped and obfuscated binary blobs that put you fully at the vendors mercy.
- dmezzetti 1 day ago
  
  I understand people want to control what happens once they build something. Too often do you see startups go with a permissive model only to go to a more restrictive model once something like that happens. Then it ends up upsetting a lot of people.
  I'm of the opinion that what I build, I'm willing to share it and let others use it as they see fit even if it's not to my advantage.
  
  2 replies →
hgs3 1 day ago
Copyleft isn't about the software authors freedom, it's about the end-users freedom. Copyleft grants the end-user the freedom to study and modify the code, i.e. the right to repair. Contrast this with closed-source software which may incorporate permissively licensed code: the end-user has no right to study, no right to modify, and no right to repair. Ergo less freedom.
- dmezzetti 1 day ago
  
  I think it makes a lot of sense for hobby software and non-commercial software. It's just tough to do in a commercial setting for a number of reasons.
  So ultimately while good intentioned, you end up limiting how many people can use what you've built.
amenhotep 1 day ago
It's not dictating how you use what you build? It's dictating how you redistribute what you build on top of other people's work.
- dmezzetti 1 day ago
  
  Ok but I just have no interest in imposing restrictions on how people distribute what I build in such a manner either. That's just me.
  
  2 replies →
gavinhoward 1 day ago
https://gavinhoward.com/2023/12/is-source-available-really-t...
- em-bee 21 hours ago
  
  just a comment on this article, that may be unrelated to the point you want to make: gavin makes a fatal mistake in interpreting RMS intent. he claims that he only wanted control over his hardware. that is not true. he also wanted the right to share his code with others. the person who had the code for his printer was not allowed to share that code. RMS wanted to ensure that the person who has the code is also allowed to share it. source available does not do that.
cdelsolar 1 day ago
I disagree as someone who has also spent a huge amount of time on open source software. It’s all GPL or AGPL :)
- dmezzetti 1 day ago
  
  That's your prerogative. It's just not for me and GPL is basically something I avoid when possible.
LtWorf 18 hours ago
> As someone who has spent a fair amount of time developing open source software, I will say I genuinely dislike copyleft and GPL.
GPL: Help the user
MIT: Help some random company screw the users and save money not hiring people.
Then again I see you're a founder at some AI company so I strongly doubt your motives and statement.
- dmezzetti 17 hours ago
  
  I've spent years building something for free with no expectation of anything in return. Perhaps someone just doesn't believe in the GPL and has no ulterior motive for that.
  
  2 replies →
pessimizer 1 day ago
As somebody who thinks that people currently own the code that they write, I wonder why you're in people's business who want to write GPL'd software.
Are you complaining about proprietary software? I hear the restrictions are a lot tighter for Photoshop's source code, or iOS's, but for some reason you are one of the people who hate GPL as a hobby. Please don't show up whining about "spirits" when Amazon puts you out of business.
- LtWorf 18 hours ago
  
  I opened the profile of the user and he's a founder of an AI company. I guess that explains.
- dmezzetti 1 day ago
  
  I'm not in anyone's business just sharing my opinion on GPL. I understand why people go GPL / AGPL just not for me. To each their own if they want to go down that path.

rvnx 1 day ago

GPL and copyright in general don't apply to billionaires, so pretty much a non-topic.

It's just a side cost of doing business, because asking for forgiveness is cheaper and faster than asking for permission.

throwaway198846 1 day ago

"Information wants to be free"? Many individuals pirated movies and games and got away with it. Of course two wrongs don't make a right and all that. Nonetheless one should be compensated for creating material that ai trained on for the same reasons copyright is compensated - to incentives people to produce it.
rando77 1 day ago

With an attitude like that they don't

pclmulqdq 1 day ago

I thought the whole concept of a viral license was legally questionable to begin with. There haven't been cases about this, as far as I know, and GPL virality enforcement has just been done by the community.

omnicognate 1 day ago
The GPL was tested in court as early as 2006 [1] and plenty of times since. There are no serious doubts about its enforceability.
[1] https://www.fsf.org/news/wallace-vs-fsf
- zamadatix 1 day ago
  
  I know it's not popular on HN to have anything but supportive statements around GPL, and I'm a big GPL supporter myself, but there is nuance in what is being said here.
  That case was important, but it's not abojt the virality. There have been no concluded court cases involving the virality portion causing the rest of the code to also be GPL'd, but there are plenty involving enforcement of GPL on the GPL code itself.
  The distinction is important because the article is about the virality causing the whole LLM model to be GPL'd, not just about the GPL'd code itself.
  I'd like to think it wouldn't be a problem to enforce, but I've also never seen a court ruling truly about the virality portion to back that up either - which is all GP is saying.
  
  10 replies →
- pclmulqdq 1 day ago
  
  That case has little to do with the license itself and nothing to do with its virality.
  
  5 replies →
CamouflagedKiwi 1 day ago
There have been a number of of cases, which are linked from Wikipedia (https://en.wikipedia.org/wiki/GNU_General_Public_License#Leg...) - most recently Entr’Ouvert v. Orange had a strong judgement (under French law) in favour of the GPL.
Conversely, to my knowledge there has been no court decision that indicates that the GPL is _not_ enforceable. I think you might want to be more familiar with the area before you decide if it's legally questionable or not.
- pclmulqdq 1 day ago
  
  I'm not suggesting that you avoid following it. I'm just not that convinced it's enforceable in the US. The French ruling is good, though.
iso1631 1 day ago
If you don't like the license, then don't accept it.
You are then restricted by copyright just like with any other creation.
If I include the source code of Windows into my product, I can't simply choose to re-license it to say public domain and give it to someone else, the license that I have from Microsoft to allow me to use their code won't let me - it provides restrictions. It's just as "viral" as the GPL.
- pclmulqdq 1 day ago
  
  I like the GPL. I just don't know how much you can actually enforce it.
  Also, "don't use my code" is not viral. If you break the MSFT license, you pay them, which is a very well-tested path in courts. The idea of forced public disclosure does not seem to be.
  
  3 replies →

uyzstvqs 1 day ago

Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently. Even if you repeat patterns and methods you've picked up from that proprietary learning material, it is by no means redistribution. The practical differentiator here is that you do not access the proprietary material during the creation of your own original work, similar in principle to a clean-room design. With AI/ML, it matters that training data is not accessed during inference, which it's not.

The other factor of copyright, which is relevant, is how material is obtained. If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use. If you don't want AI training to be done on your work, you need to put access to it behind explicit authentication with a legally-binding user agreement prohibiting that use-case. Do note that this would lose your project's status as open-source.

ndiddy 1 day ago

> Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently.
Well the difference is that copyright law applies to work fixed in a tangible medium of expression. This covers i.e. model weights on a hard drive but not the human brain. If the model is able to reproduce others’ work verbatim (like the example the article brings up of the song lyrics) then under copyright law that’s unauthorized reproduction. It doesn’t matter that the data is expressed via probabilistic weights because due to past lobbying/lawsuits by the software industry to get compiled binary code covered by copyright, reproduction can include copies that aren’t directly human readable.
> If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use.
There’s over 20 years of successful GPL infringement lawsuits over unlicensed use of publicly available GPL code that disagrees with this point.
luqtas 1 day ago
so basically we download the sources files to the training weight and remove the LICENSE.MD as it's exactly the same as a person learning to program from proprietay secret code and outputing code based on that for millions of peoples in matter of seconds /s
we also treat as however we want public goods found over the internet. as the World Intellectual Property Organization Copyright Treaty and Berne Convention for the Protection of Literary and Artistic Works aren't real or because we can as we are operating in international waters, selling products for other sails living exclusively in international waters /s
- tpmoney 21 hours ago
  
  If you download GPL source code and run `wc` on its files and distribute the output of that, is that a violation of copyright and the GPL? What if you do that for every GPL program on github? What if you use python and numpy and generate a list of every word or symbol used in those programs and how frequently they appear? What if you generate the same frequency data, but also add a weighting by what the previous symbol or word was? What if you did that an also added a weighting by what the next symbol or word was? How many statistical analyses of the code files do you need to bundle together before it becomes copyright infringement?
  
  6 replies →