← Back to context

Comment by sfink

1 day ago

The line is somewhere between running wc on the entire input and running gzip on the entire input.

The fact that a slippery slope is slippery doesn't make it not a slope.

Of course there is a line. And everything we know about how AI models work points to them being on the ‘wc’ side of the line

  • Not the way I see it.

    The argument that GPL code is a tiny minority of what's in the model makes no sense to me. (To be clear, you're not making this argument.) One book is a tiny minority of an entire library, but that doesn't mean it's fine to copy that book word for word simply because you can point to a Large Library Model that contains it.

    LLMs definitely store pretty high-fidelity representations of specific facts and procedures, so for me it makes more sense to start from the gzip end of the slope and slide the other way. If you took some GPL code and renamed all the variables, is that suddenly ok? What if you mapped the code to an AST and then stored a representation of that AST? What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different? It would be the analogue of (lossy) perceptual coding for audio compression, only instead of "perceptual" it's "functional".

    This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.

    It also feels a lot closer to 'gzip' than 'wc', imho.

    • > LLMs definitely store pretty high-fidelity representations of specific facts and procedures

      Specific facts and procedures are explicitly NOT protected by copyright. That's what made cloning the IBM BIOS legal. It's what makes emulators legal. It's what makes the retro-clone RPG industry legal. It's what made Google cloning the Java API legal.

      > If you took some GPL code and renamed all the variables, is that suddenly ok?

      Generally no, not sufficiently transformative.

      > What if you mapped the code to an AST and then stored a representation of that AST?

      Generally no, binary distribution of software is considered a violation of copyright.

      > What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different?

      This starts to get a lot fuzzier. De-compilation is legal. Creating programs that are functionally identical to other programs is (generally) legal. Creating an emulator for a system is legal. Copyright protects a specific fixed expression of a creative idea, not the idea itself. We don't want to live in the world where Wine is a copyright violation.

      > This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.

      And yet, so far no one has brought a legal case against the AI companies for being able to extract their copyright protected material from the models. The few early examples of that happening are things that model makers explicitly attempt to train out of their models. It's unwanted behavior that is considered a bug, not a feature. Further the fact that a machine is able to violate copyright does not in and of itself make the machine itself a violation of copyright. See also Xerox machines, DeCSS, Handbrake, Plex/Jellyfin, CD-Rs, DVRs, VHS Recorders etc.

      2 replies →