← Back to context

Comment by barrucadu

3 months ago

> I don’t think a country’s government can justify no commercial LLMs to its populace

They're not saying no LLMs, they're saying no LLMs using lyrics without a license. OpenAI simply need to pay for a license, or train an LLM without using lyrics.

But lyrics are just one example. Are you saying that training experiments must filter out all substrings from the training input that bear too close a resemblance to a substring of a copyrighted work?

  • Obviously there's a limit, reproducing a single sentence is unlikely to be copyright infringement just because there are only so many words in a language; but if reproducing some text would be copyright infringement if a human did it, I don't see why LLM companies should get a free pass.

    If it's really essential that they train their models on song lyrics, or books, or movie scripts, or articles, or whatever, they should pay license fees.

Oi, you got a loisense to read those words and then repeat them back to me when asked?

  • I take it you think copyright shouldn't exist at all, then?

    • That is a separate opinion, but with respect to the question at hand, the utilitarian value of being able to ask a computer "what are the lyrics to x" and having it produce them outweighs whatever small ideological sanctity the music labels assign to being able to gatekeep the written words of a composition to a small blessed few. It's not like chat gpt is serving up the mp3 file to you. So correct, it is insane to me that mere reproduction of just the lyrics is afforded such weighty copy protection.

      (Vis a vis, I take it you write a certified letter to Universal before reproducing Happy Birthday in public? ;) That is actually a far more egregious violation indeed, as it is both a performance of the copyrighted work and in front of an audience - neither of which are the case for the chatbot - yet one we all seem to understand to be fair use.

This obviously applies to all copyrighted works. I could sue OpenAI when it reproduces my source code that I published on the Internet.

They already "filter" the code to prevent it from happening (reproducing exact works). My guess it is just superficially changing things around so it is harder to prove copyright violations.