← Back to context

Comment by tpmoney

1 day ago

If you download GPL source code and run `wc` on its files and distribute the output of that, is that a violation of copyright and the GPL? What if you do that for every GPL program on github? What if you use python and numpy and generate a list of every word or symbol used in those programs and how frequently they appear? What if you generate the same frequency data, but also add a weighting by what the previous symbol or word was? What if you did that an also added a weighting by what the next symbol or word was? How many statistical analyses of the code files do you need to bundle together before it becomes copyright infringement?

The line is somewhere between running wc on the entire input and running gzip on the entire input.

The fact that a slippery slope is slippery doesn't make it not a slope.

  • Of course there is a line. And everything we know about how AI models work points to them being on the ‘wc’ side of the line

    • Not the way I see it.

      The argument that GPL code is a tiny minority of what's in the model makes no sense to me. (To be clear, you're not making this argument.) One book is a tiny minority of an entire library, but that doesn't mean it's fine to copy that book word for word simply because you can point to a Large Library Model that contains it.

      LLMs definitely store pretty high-fidelity representations of specific facts and procedures, so for me it makes more sense to start from the gzip end of the slope and slide the other way. If you took some GPL code and renamed all the variables, is that suddenly ok? What if you mapped the code to an AST and then stored a representation of that AST? What if it was a "fuzzy" or "probabilistic" AST that enabled the regeneration of a functionally equivalent program but the specific control flow and variable names and comments are different? It would be the analogue of (lossy) perceptual coding for audio compression, only instead of "perceptual" it's "functional".

      This is starting to look more and more like what LLMs store, though they're actually dumber and closer to the literal text than something that maintains function.

      It also feels a lot closer to 'gzip' than 'wc', imho.

      3 replies →