← Back to context

Comment by galaxyLogic

7 hours ago

But then if AI output is not under GNU General Public License, how can it become so just because a Linux-developer adds it to the code-base?

AIs are not human and therefore their output is a human authored contribution and only human authored things are covered by copyright. The work might hypothetically infringe on other people's copyright. But such an infringement does not happen until a human decides to create and distribute a work that somehow integrates that generated code or text.

The solution documented here seems very pragmatic. You as a contributor simply state that you are making the contribution and that you are not infringing on other people's work with that contribution under the GPLv2. And you document the fact that you used AI for transparency reasons.

There is a lot of legal murkiness around how training data is handled, and the output of the models. Or even the models themselves. Is something that in no way or shape resembles a copyrighted work (i.e. a model) actually distributing that work? The legal arguments here will probably take a long time to settle but it seems the fair use concept offers a way out here. You might create potentially infringing work with a model that may or may not be covered by fair use. But that would be your decision.

For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.

  • That you can't copyright the AI's output (in the US, at least), doesn't imply it doesn't contain copyrighted material. If you generate an image of a Disney character, Disney still owns the copyright to that character.

  • IANAL; this is what my limited understanding of the matter is. With that caveat: it is easy to forget that copyright is on output- verbatim or exact reproductions and derivatives of a covered work are already covered under copyright.

    So if the AI outputs Starry Night or Starry Night in different color theme, that's likely infringement without permission from van Gogh, who would have recourse against someone, either the user or the AI provider.

    But a starry-night style picture of an aquarium might not be infringing at all.

    >For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.

    I would argue that if it was a verbatim reproduction of a copyrighted piece of software, that would likely be infringing. But if it was similar only in style, with different function names and structure, probably not infringing.

    Folks will argue that some things might be too small to do any different, for example a tiny snippet like python print("hello") or 1+1=2 or a for loop in your example. In that case it's too lacking in original expression to qualify for copyright protection anyway.

  • >AIs are not human and therefore their output is a human authored contribution and only human authored things are covered by copyright.

    That is a non sequitur. Also, I'm not sure if copyright applies to humans, or persons (not that I have encountered particularly creative corporations, but Taranaki Maunga has been known for large scale decorative works)

  • Didn't a court in the US declare that AI generated content cannot be copyrighted? I think that could be a problem for AI generated code. Fine for projects with an MIT/BSD license I suppose, but GPL relies on copyright.

    However, if the code has been slightly changed by a human, it can be copyrighted again. I think.

    • Thaler v. Perlmutter said that an AI system cannot be listed as the sole author of a work - copyright requires a human author.

      US Copyright Office guidance in 2023 said work created with the help of AI can be registered as long as there is "sufficient human creative input". I don't believe that has ever been qualified with respect to code, but my instinct is that the way most people use coding agents (especially for something like kernel development) would qualify.

      1 reply →

    • No, a court did not declare that. The case involved a person trying to register a work with only the AI system listed as author. The Supreme Court decided that you can't do that, you need to list a human being as author to register a work with the Copyright Office. This stems from existing precedent where someone tried to register a photograph with the monkey photographer listed as author.

      I don't believe the idea that humans can or can't claim copyright over AI-authored works has been tested. The Copyright Office says your prompt doesn't count and you need some human-authored element in the final work. We'll have to see.

      2 replies →

    • > Didn't a court in the US declare that AI generated content cannot be copyrighted?

      No, my understanding is that AI generated content can't be copyrighted by the AI. A human can still copyright it, however.

Same as if a regular person did the same. They are responsible for it. If you're using AI, check the code doesn't violate licenses

  • In certain law cases plagiarization can be influenced by the fact if person is exposed to the copyrighted work. AI models are exposed to very large corpus of works..

    • Copyright infringement and plagiarism are not the same or even very closely related. They're different concepts and not interchangeable. Relative to copyright infringement, cases of plagiarism are rarely a matter for courts to decide or care about at all. Plagiarism is primarily an ethical (and not civil or criminal) matter. Rather than be dealt with by the legal system, it is the subject of codes of ethics within e.g. academia, journalism, etc. which have their own extra-judicial standards and methods of enforcement.

      1 reply →

  • As opposed to an irregular person?

    LLMs are not persons, not even legal ones (which itself is a massive hack causing massive issues such as using corporate finances for political gain).

    A human has moral value a text model does not. A human has limitations in both time and memory available, a model of text does not. I don't see why comparisons to humans have any relevance. Just because a human can do something does not mean machines run by corporations should be able to do it en-masse.

    The rules of copyright allow humans to do certain things because:

    - Learning enriches the human.

    - Once a human consumes information, he can't willingly forget it.

    - It is impossible to prove how much a human-created intellectual work is based on others.

    With LLMs:

    - Training (let's not anthropomorphize: lossily-compressing input data by detecting and extracting patterns) enriches only the corporation which owns it.

    - It's perfectly possible to create a model based only on content with specific licenses or only public domain.

    - It's possible to trace every single output byte to quantifiable influences from every single input byte. It's just not an interesting line of inquiry for the corporations benefiting from the legal gray area.

Tab complete does not produce copyrightable material either. Yet we don't require software to be written in nano.

If the output is public domain it's fine as I understand it.

  • Makes sense to me. But so anybody can take Public Domain code and place it under GNU Public License (by dropping it into a Linux source-code file) ?

    Surely the person doing so would be responsible for doing so, but are they doing anything wrong?

    • > Surely the person doing so would be responsible for doing so, but are they doing anything wrong?

      You're perfectly at liberty to relicense public domain code if you wish.

      The only thing you can't do is enforce the new license against people who obtain the code independently - either from the same source you did, or from a different source that doesn't carry your license.

      2 replies →

    • Linux code doesn't have to strictly be GPL-only, it just has to be GPL-compatible.

      If your license allows others to take the code and redistribute it with extra conditions, your code can be imported into the kernel. AFAIK there are parts of the kernel that are BSD-licensed.

    • The core thing about licenses, in general, is that they only grant new usage. If you can already use the code because it's public domain, they don't further restrict it. The license, in that case, is irrelevant.

      Remember that licenses are powered by copyright - granting a license to non-copyrighted code doesn't do anything, because there's no enforcement mechanism.

      This is also why copyright reform for software engineering is so important, because code entering the public domain cuts the gordian knot of licensing issues.

    • Sqlite’s source code is public domain. Surely if you dropped the sqlite source code into Linux, it wouldn’t suddenly become GPL code? I’m not sure how it works

  • This ruling is IMO/IANAL based on lawyers and judges not understanding how LLMs work internally, falling for the marketing campaign calling them "AI" and not understanding the full implications.

    LLM-creation ("training") involves detecting/compressing patterns of the input. Inference generates statistically probable based on similarities of patterns to those found in the "training" input. Computers don't learn or have ideas, they always operate on representations, it's nothing more than any other mechanical transformation. It should not erase copyright any more than synonym substitution.

    • >LLM-creation ("training") involves detecting/compressing patterns of the input.

      There's a pretty compelling argument that this is essentially what we do, and that what we think of as creativity is just copying, transforming, and combining ideas.

      LLMs are interesting because that compression forces distilling the world down into its constituent parts and learning about the relationships between ideas. While it's absolutely possible (or even likely for certain prompts) that models can regurgitate text very similar to their inputs, that is not usually what seems to be happening.

      They actually appear to be little remix engines that can fit the pieces together to solve the thing you're asking for, and we do have some evidence that the models are able to accomplish things that are not represented in their training sets.

      Kirby Ferguson's video on this is pretty great: https://www.youtube.com/watch?v=X9RYuvPCQUA

      1 reply →

    • fortunately, you aren't only operating on representations, right? lemme check my Schopenhauer right quick...