← Back to context

Comment by _heimdall

22 days ago

I don't think the concern is related specifically to training on computer chips with copyrighted content.

If you are going to use human brain cells to memorize protected content and sell it as a product, that's still an issue based on current copyright laws.

> If you are going to use human brain cells to memorize protected content and sell it as a product, that's still an issue based on current copyright laws.

And yet, that's all most billable hours at McKinsey, BCG, KPMG, are for. Those consultants memorized copyrighted stuff so your executives didn't have to.

It's very difficult to explain how GPT is not consulting.

  • The question there still comes down to what they did with the memorized content. There's nothing wrong with memorizing copyrighted content, there's legally a problem with trying to resell it without paying royalties under contract to the owner of the copyright.

    • The problem with LLMs is techbro crowd trying to pretend they are like thinking humans (all these analogies with us memorizing things) when it comes to rights like access to information and copyright abuse, but not thinking humans when it comes to using LLMs themselves.

      You’d think any logical person would believe only one or the other is true, but big tech got many people believing this paradox because the industry depends on it. As soon as the paradox is over, the industry is revealed to be either based on IP theft or on slavery.

      There is no problem to argue about, really: laws and basic rights and freedoms exist for humans; if %thing% (be it made of chips or brain cells) is not considered human then laws and rights apply to humans who operate it; if the thing is considered human then it itself has human rights to be reckoned with.

Once again...LLMs are not massive archives of data.

You would never want to use an LLM to archive your writings or documents. They are incredibly bad at this.

  • They were never designed to be archives though, of course they're bad at something that not only was not a goal bit is opposite of a primary design factor.

    LLMs are massive, lossy compressed datasets. They were designed to store the gist of effectively all digital language content humans ever created so an inference engine could use that data space to predict what a person might say to a prompt.

    They were never designed to regurgitate exact copies of the original sources just use your favorite zip algorithm for that.

    The question would be how closely an LLM can regurgitate an answer before running into copyright issues, and how the original training dataset was obtained.