Comment by tpmoney
19 hours ago
Of course there is a line. And everything we know about how AI models work points to them being on the ‘wc’ side of the line
19 hours ago
Of course there is a line. And everything we know about how AI models work points to them being on the ‘wc’ side of the line
Not the way I see it.
The argument that GPL code is a tiny minority of what's in the model makes no sense to me. (To be clear, you're not making this argument.) One book is a tiny minority of an entire library, but that doesn't mean it's fine to copy that book word for word simply because you can point to a Large Library Model that contains it.
LLMs definitely store pretty high-fidelity representations of specific facts and procedures, so for me it makes more sense to start from the gzip end of the slope and slide the other way. If you took some GPL code and renamed all the variables, is that suddenly ok? What if you mapped the code to an AST and then stored a representation of that AST? What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different? It would be the analogue of (lossy) perceptual coding for audio compression, only instead of "perceptual" it's "functional".
This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.
It also feels a lot closer to 'gzip' than 'wc', imho.
> LLMs definitely store pretty high-fidelity representations of specific facts and procedures
Specific facts and procedures are explicitly NOT protected by copyright. That's what made cloning the IBM BIOS legal. It's what makes emulators legal. It's what makes the retro-clone RPG industry legal. It's what made Google cloning the Java API legal.
> If you took some GPL code and renamed all the variables, is that suddenly ok?
Generally no, not sufficiently transformative.
> What if you mapped the code to an AST and then stored a representation of that AST?
Generally no, binary distribution of software is considered a violation of copyright.
> What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different?
This starts to get a lot fuzzier. De-compilation is legal. Creating programs that are functionally identical to other programs is (generally) legal. Creating an emulator for a system is legal. Copyright protects a specific fixed expression of a creative idea, not the idea itself. We don't want to live in the world where Wine is a copyright violation.
> This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.
And yet, so far no one has brought a legal case against the AI companies for being able to extract their copyright protected material from the models. The few early examples of that happening are things that model makers explicitly attempt to train out of their models. It's unwanted behavior that is considered a bug, not a feature. Further the fact that a machine is able to violate copyright does not in and of itself make the machine itself a violation of copyright. See also Xerox machines, DeCSS, Handbrake, Plex/Jellyfin, CD-Rs, DVRs, VHS Recorders etc.
> Specific facts and procedures are explicitly NOT protected by copyright.
No argument there, and I'm grateful for the limits of copyright. That part was only for describing what LLM weights store -- just because the literal text is not explicitly encoded doesn't mean that facts and procedures aren't.
> Copyright protects a specific fixed expression of a creative idea, not the idea itself.
Right. Which is why it's weird to talk about the weights being derivative works. Weird but perhaps not wrong: if you look at the most clear-cut situation where the LLM is able to reproduce a big chunk of input bit-for-bit, then the fact that its basis of representation is completely different doesn't feel like it matters much. An image that is lossily compressed, converted to a bitstream, and encoded in DNA is very very different than the input, but if an image can be recovered that is indistinguishable or barely distinguishable from the original, I'd still call that copying and each intermediate step a significant but irrelevant transformation.
> This starts to get a lot fuzzier. De-compilation is legal.
I'm less interested in what the legal system is currently capable of concluding. I personally don't think the laws have caught up to the present reality, so present-day legality isn't the crucial determinant in figuring out how things "ought" to work.
If an LLM is completely incapable of reproducing input text verbatim, yet could become so through targeted ablation (that does not itself incorporate the text in question!), then does it store that text or not?
I'm not sure why I'm even debating this, other than for intellectual curiosity. My opinion isn't actually relevant to anyone. Namely: I think the general shape of how this ought to work is pretty straightforward and obvious, but (1) it does not match current legal reality, and more importantly, (2) it is highly inconvenient for many stakeholders (very much including LLM users). Not to mention that (3) although the general shape is pretty clear in my head, it involves many many judgement calls such as the ones we've been discussing here, and the general shape of how it ought to work isn't going to help make those calls.
1 reply →