← Back to context

Comment by snowhale

17 hours ago

the spec-first approach is actually the historical clean-room technique, same way Phoenix BIOS was legally written without copyright exposure in the 80s -- one team writes spec from observation, completely separate team codes from spec only, no shared authors. here it's AI doing both passes but in different sessions with no shared context, which approximates the same separation. probably good enough legally but definitely interesting that the same old trick applies.

I wouldn't call this "clean-room". The models were trained on all available open source, including that exact original Linux driver. Splitting sessions saves you from direct copy-paste in the current context window, but the weights themselves remember the internal code structure perfectly well. Lawyers still have to rack their brains over this, but for now, it looks more like license laundering through the neural net's latent space than true reverse engineering

You haven't addressed the parent's concern at all, which is that what the LLM was trained on, not what was fed into its context window. The Linux driver is almost certainly in the LLM's training data.

Also, the "spec" that the LLM wrote to simulate the "clean-room" technique is full of C code from the Linux driver.

  • This is speculation, but I suspect the training data argument is going to be a real loser in the courtroom. We’re getting out of the region where memorization is a big failure mode for frontier models. They are also increasingly trained on synthetic text, whose copyright is very difficult to determine.

    We also so far have yet to see anyone successfully sue over software copyright with LLMs—-this is a bit redundant, but we’ve also not seen a user of one of these models be sued for output.

    Maybe we converge on the view of the US copyright office which is that none of this can be protected.

    I kind of like that one as a future for software engineers, because it forces them all at long last to become rules lawyers. If we disallow all copyright protection for machine generated code, there might be a cottage industry of folks who provide a reliably human layer that is copyrightable. Like Boeing, they will have to write to the regulator and not to the spec. I feel that’s a suitable destination for a discipline. That’s had it too good for too long.

  • Okay, so will companies now vibe-code a Linux-like license-washed kernel, to get rid of the GPL?

    > The Linux driver is almost certainly in the LLM's training data.

    Yes, and? Isn't Stallmans first freedom the "freedom to study the source code" (FSF Freedom I)? Where does it say I have to be a human to study it? If you argue "oh but you may only read / train on the source code if you are intending to write / generate GPL code", then you're admitting that the GPL effectively is only meant for "libre" programmers in their "libre" universe and it might as well be closed-source. If a human may study the code to extract the logic (the "idea") without infringing on the expression, why is it called "laundering" if a machine does it?

    Let's say I look (as a human) at some GPL source code. And then I close the browser tab and roughly re-implement from memory what I saw. Am I now required to release my own code as GPL? More extreme: If I read some GPL code and a year later I implement a program that roughly resembles what I saw back then, then I can, in your universe, be sued because only "libre programmers" may read "libre source code".

    In German copyright law, there is a concept of a "fading formula": if the creative features of the original work "fade away" behind the independent content of the new work to the point of being unrecognizable, it constitutes a new work, not a derivative, so the input license doesn't matter. So, for LLMs, even if the input is GPL, proprietary, whatever: if the output is unrecognizable from the input, it does not matter.

    • > Let's say I look (as a human) at some GPL source code. And then I close the browser tab and roughly re-implement from memory what I saw. Am I now required to release my own code as GPL? More extrtsembles what I saw back then, then I can, in your universe, be sued because only "libre programmers" may read "libre source code".

      It's entirely dependent on how similar the code you write is to the licensed code that you saw, and what could be proved about what you saw, but potentially yes: if you study GPL code, and then write code that is very uniquely similar to it, you may have infringed on the author's copyright. US courts have made some rulings which say that the substantial similarity standard does apply to software, although pretty much every ruling for these cases ends up in the defendant's favor (the one who allegedly "copied" some software).

      > So, for LLMs, even if the input is GPL, proprietary, whatever: if the output is unrecognizable from the input, it does not matter.

      Sure, but that doesn't apply to this instance. This is implementing a BSD driver based on a Linux driver for that hardware. I'm not making the general case that LLMs are committing copyright infringement on a grand scale. I'm saying that giving GPL code to an LLM (in this case the GPL code was input to the model, which seems much more egregious than it being in the training data) and having the LLM generate that code ported to a new platform feels slimy. If we can do this, then copyleft licenses will become pretty much meaningless. I gather some people would consider that a win.

  • fair point, I glossed over that distinction. context separation \!= training data separation. if the driver was in training data, the "spec from observation" pass is already contaminated before the coding pass begins. the phoenix bios parallel actually required strict information separation at every stage -- here that's not achievable since you can't retrain the model. so the legal protection is much weaker than I implied.