Comment by sfink

21 hours ago

> Specific facts and procedures are explicitly NOT protected by copyright.

No argument there, and I'm grateful for the limits of copyright. That part was only for describing what LLM weights store -- just because the literal text is not explicitly encoded doesn't mean that facts and procedures aren't.

> Copyright protects a specific fixed expression of a creative idea, not the idea itself.

Right. Which is why it's weird to talk about the weights being derivative works. Weird but perhaps not wrong: if you look at the most clear-cut situation where the LLM is able to reproduce a big chunk of input bit-for-bit, then the fact that its basis of representation is completely different doesn't feel like it matters much. An image that is lossily compressed, converted to a bitstream, and encoded in DNA is very very different than the input, but if an image can be recovered that is indistinguishable or barely distinguishable from the original, I'd still call that copying and each intermediate step a significant but irrelevant transformation.

> This starts to get a lot fuzzier. De-compilation is legal.

I'm less interested in what the legal system is currently capable of concluding. I personally don't think the laws have caught up to the present reality, so present-day legality isn't the crucial determinant in figuring out how things "ought" to work.

If an LLM is completely incapable of reproducing input text verbatim, yet could become so through targeted ablation (that does not itself incorporate the text in question!), then does it store that text or not?

I'm not sure why I'm even debating this, other than for intellectual curiosity. My opinion isn't actually relevant to anyone. Namely: I think the general shape of how this ought to work is pretty straightforward and obvious, but (1) it does not match current legal reality, and more importantly, (2) it is highly inconvenient for many stakeholders (very much including LLM users). Not to mention that (3) although the general shape is pretty clear in my head, it involves many many judgement calls such as the ones we've been discussing here, and the general shape of how it ought to work isn't going to help make those calls.

> An image that is lossily compressed, converted to a bitstream, and encoded in DNA is very very different than the input, but if an image can be recovered that is indistinguishable or barely distinguishable from the original, I'd still call that copying and each intermediate step a significant but irrelevant transformation.

Sure as a broad rule of thumb that works. But the ability of a machine to produce a copyright violation doesn't mean the machine itself or distributing the machine is a copyright violation. To take an extreme example, if we take a room full infinite monkeys and put them on infinite typewriters and they generate a Harry Potter book, that doesn't mean Harry Potter is stored in the monkey room. If we have a random sound generator that produces random tones from the standard western musical note pallet and it generates the bass line from "Under Pressure" that doesn't mean our random sound generator contains or is a copy of "Under Pressure", even if we encoded all the same information and procedures for generating those individual notes at those durations among the data procedures we gave the machine.

> If an LLM is completely incapable of reproducing input text verbatim, yet could become so through targeted ablation (that does not itself incorporate the text in question!), then does it store that text or not?

I would argue not. Just like a xerox machine doesn't contain the books you make copies of when you use it to make a copy, and Handbrake doesn't contain the DVD's you use when you make a copy there.

I would further argue that copyright infringement is inherently a "human" act. It's sort of encoded in the language we use to talk about it (e.g. "fair use") but it's also something of a "if a tree falls in the middle of the woods" situation. If an LLM runs in an isolated room in an isolated bunker with no one around and generates verbatim copies of the Linux kernel, that frankly doesn't matter. On the other hand, if a Microsoft employee induces an LLM to produce verbatim copies of the Linux kernel, that does, especially if they did so with the intent to incorporate Linux kernel code into Windows. Not because of the LLM, but because a person made the choice to produce a copy of something they didn't have the right to make a copy of. The method by which they accomplished that copy is less relevant than making the copy at all, and that in turn is less relevant than the intent of making that copy for a purpose which is not allowed by copyright law.

> I'm not sure why I'm even debating this, other than for intellectual curiosity.

Frankly, that's the only reason to debate anything. 99% of the time, you as an individual will never have the power to influence the actual legal decisions made. But a intellectually curious conversation is infinitely more useful, not just to you and me but to other readers, than another retread of "AI is slop" "you're just jealous you can't code your way out of a paper bag" arguments that pervade so much discussion around AI. Or worse yet another "I used an LLM for a clearly stupid thing and it was stupid" or "I used an LLM to replace all my employees and I'm sure it's going to go great" blog post. For whatever acrimony there might have been in our interchange here, I'm sorry, because this sort of discussion is the only good way to exercise our thoughts on an issue and really test them out ourselves. It's easy to have a knee jerk opinion. It's harder to support that opinion with a philosophy and reasoning.

For what it's worth, I view the LLM/AI world as the best opportunity we've had in decades to really rethink and scale back/change how we deal with intellectual property. The ever expanding copyright terms, the sometimes bizarre protections of what seem to be blindingly obvious ideas. The technological age has demonstrated a number of weaknesses in the traditional systems and views. And frankly I think it's also demonstrated that many prior predictions of certain doom if copyright wasn't strictly enforced have been overwrought and even where they haven't, the actual result has been better for more people. Famously, IBM would have very much preferred to have won the BIOS copyright issue. But I think so much of the modern computer and tech industry owes their very careers to the effects of that decision. It might have been better for IBM if IBM had won, it's not clear at all that it would have been better for "[promoting] the Progress of Science and useful Arts".

We could live in a world where we recognize that LLMs and AIs are going to fundamentally change how we approach creative works. We could recognize that the intents of "[promoting] the Progress of Science and useful Arts" is still a relevant goal and something we can work to make compatible with the existence of LLMs and AI. To pitch my crazy idea again, we could:

1) Cut the terms of copyright substantially, back down to 10 or 15 years by default.

2) Offer a single extension that doubles that term, but only on the condition that the work is submitted to a central "library of congress" data set.

3) This could be used to produce known good and clean data sets for AI companies and organization to train models from, with the protection that any model trained from this data set is protected from copyright infringement claims for works in the data set. Heck we could even produce common models. This would save massive amounts of power and resources by cutting the need for everyone who wants to be in the AI space to go out and acquire, digitize and build their own library. The NIST numbers set is effectively the "hello world" set for anyone learning computer vision AI stuff. Let's do that for all sort of AI.

4) The data sets and models would be provided for a nominal fee, this fee will be used to pay royalties to people whose works are still under copyright and are in the data sets, proportional to the recency and quantity of work submitted. A cap would need to be put in place to prevent flooding the data set to game the royalties. These royalties would be part of recognizing the value the original works contributed to the data set, and act as a further incentive to contribute works to the system and contribute them sooner.

We could build a system like this, or tweak it, or even build something else entirely. But only if we stop trying to cram how we treat AI and LLMs and the consequences of this new technology into a binary "allowed / not allowed" outcome as determined by an aging system that has long needed an overhaul.

So please, continue to debate for intellectual curiosity. I'd rather spend hours reading a truly curious exploration of this than another manifesto about "AI slop"