Comment by wvenable

1 month ago

> If I use my knowledge to produce code, under a specific license, then you take that code, and reproduce it without the license, you have broken the law.

Correct. But if read your code, produce a detailed specification of that code, and then give that code to another team (that has never seen your code) and they create a similar product then they haven't broken the law.

LLMs reproducing exact content from their training data is symptom of overfitting and is an error that needs correcting. Memorizing specific training data means that it is not generalizing enough.

2 comments

wvenable

themafia 1 month ago

> and they create a similar product then they haven't broken the law.

That costs significantly more and involves the creation of jobs. I see this as a great outcome. There seems to be a group of people who share the opposite of my views on this matter.

> and is an error that needs correcting

It's been known for years. They don't seem interested in doing that or they simply aren't capable. I presume because most of the value in their service _is_ the copyright whitewashing.

> Memorizing specific training data means that it is not generalizing enough.

Is that like a knob they can turn or is it something much more fundamental to the technology they've staked trillions on?

wvenable 1 month ago

> That costs significantly more and involves the creation of jobs. I see this as a great outcome.
I don't see it that way. If whatever you're doing can now be automated then it's become a bullshit job. It no longer a benefit to humanity to have a human sit on their ass, stand on their feet, or break their back to do a job that can be automated. As a software developer, it's my job to take the dumb repetitive stuff that humans do and make it so that humans never have to do that job again.
If that's a problem for society, it's because society is messed up.
> It's been known for years. They don't seem interested in doing that or they simply aren't capable.
I don't find that to be particularly big problem. Fundamentally an AI isn't just compressing all human knowledge and decompressing it on demand; it's tweaking parameters in a giant matrix. I can reproduce the lyrics of songs that I've heard but that doesn't mean there is a literal copy of that song in my brain that you could extract out with a well placed scalpel. It just means I've heard it a bunch of times and the giant matrix in my brain is tuned to be able to spit it out.
> Is that like a knob they can turn or is it something much more fundamental to the technology they've staked trillions on?
In a sense, it a knob. It's not fundamental to the technology; if it's reproducing something exactly that likely means it's over-trained on that data. It's actually bad for the models (makes them more incorrect, more rigid, and more repetitive) so that is a knob they will turn.