← Back to context

Comment by _heimdall

20 days ago

They were never designed to be archives though, of course they're bad at something that not only was not a goal bit is opposite of a primary design factor.

LLMs are massive, lossy compressed datasets. They were designed to store the gist of effectively all digital language content humans ever created so an inference engine could use that data space to predict what a person might say to a prompt.

They were never designed to regurgitate exact copies of the original sources just use your favorite zip algorithm for that.

The question would be how closely an LLM can regurgitate an answer before running into copyright issues, and how the original training dataset was obtained.