Comment by themafia

13 days ago

> I can already hear the cries of protest from other engineers who (like me) are clutching onto their hard-won knowledge.

You mean the knowledge that Claude has stolen from all of us and regurgitated into your projects without any copyright attributions?

> But I see a lot of my fellow developers burying their heads in the sand

That feeling is mutual.

14 comments

themafia

wvenable 13 days ago

> You mean the knowledge that Claude has stolen from all of us and regurgitated into your projects without any copyright attributions?

You can't, and shouldn't be able to, copyright and hoard "knowledge".

themafia 13 days ago
I did not suggest that; however, the law is clear. If I use my knowledge to produce code, under a specific license, then you take that code, and reproduce it without the license, you have broken the law.
You can twist this around as much as you like but there are several studies showing that LLMs and and will happily reproduce content from their training data.
- wvenable 13 days ago
  
  > If I use my knowledge to produce code, under a specific license, then you take that code, and reproduce it without the license, you have broken the law.
  Correct. But if read your code, produce a detailed specification of that code, and then give that code to another team (that has never seen your code) and they create a similar product then they haven't broken the law.
  LLMs reproducing exact content from their training data is symptom of overfitting and is an error that needs correcting. Memorizing specific training data means that it is not generalizing enough.
  
  2 replies →

yibers 13 days ago

We did the same as devopers before Claude. We would copy paste from stack overflow. Now this process is heavily automated.

themafia 13 days ago
> Now this process is heavily automated.
And comes with a price tag paid to people who neither own nor generated that content. You don't think that shifts the ethical boundaries _significantly_?
- rpdillon 13 days ago
  
  I don't. The general trend is that, in US rulings, courts have found that if the material was obtained legally, then training can be fair use. My understanding is that getting LLMs to regurgitate anything significant requires very specific prompting, in general.
  I would very much like someone to give me the magic reproduction triple: a model trained on your code, a prompt you gave it to produce a program, and its output showing copyright infringement on the training material used. Specific examples are useful; my hypothesis is that this won't be possible using a "normal" prompt that's in general use, but rather a prompt containing a lot of directly quoted content from the training material, that then asks for more of the same. This was a problem for the NYT when they claimed OpenAI reproduced the content of their articles...they achieved this by prompting with large, unmodified sections of the article and then the LLM would spit out a handful of sentences. In their briefing to the court, they neglected to include their prompts for this reason. I think this is significant because it relates to what is really happening, rather than what people imagine is happening.
  But I guess we'll get to see from the NYT trial, since OpenAI is retaining all user prompts and outputs and providing them to the NYT to sift through. So the ground-truth exists, I'm sure they'll be excited to cite all the cases where people were circumventing their paywall with OpenAI.
  
  2 replies →
sodapopcan 13 days ago
...from answers that were publicly shared without license. It's not the same thing, even though every LOVES to make this argument.
Also: Over the past 20 years, I could count the number of times on one hand that I was been able to get away with out-right copy/paste from SO.
- eichin 13 days ago
  
  Stackoverflow code has a license (not per post, but a blanket one depending on which year - https://stackoverflow.com/help/licensing it's mostly CC BY-SA.) I've written corporate policies that emphasize that you can learn from SO answers, but (as you point out) they basically never fit exactly - and you should include a link to the original so when the next Ubuntu LTS breaks your clever hack, we can see if someone has already posted a fix :-)
- rpdillon 13 days ago
  
  In a prior job, I had to scan a 2M+ line codebase for software license violations to support the sale of a unit to another corporation. One class of violation was using SO snippets, because they are licensed under CC and not compatible with the distribution model the new company was planning. Many weeks of work to track them all down.
  
  1 reply →