Comment by mlinsey

2 months ago

Certainly if a human wrote code that solved this problem, and a second human copied and tweaked it slightly for their use case, we would have a word for them.

Would we use the same word if two different humans wrote code that solved two different problems, but one part of each problem was somewhat analogous to a different aspect of a third human's problem, and the third human took inspiration from those parts of both solutions to create code that solved a third problem?

What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?

I think "plagiarism", "inspiration", and just "learning from" fall on some continuous spectrum. There are clear differences when you zoom out, but they are in degree, and it's hard to set a hard boundary. The key is just to make sure we have laws and norms that provide sufficient incentive for new ideas to continue to be created.

43 comments

mlinsey

nitwit005 2 months ago

Ask for something like "a first person shooter using software rendering", and search github for the function names for the rendering functions. Using Copilot I found code simply lifted from implementations of Doom, except that "int" was replaced with "int32_t" and similar.

It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.

martin-t 2 months ago
And this is just the stuff you notice.
1) Verbatin copy is first-order plagiarism.
2a) Second-order plagiarism of written text would be replacing words with synonyms. Or taking a book paragraph by paragraph and for each one of them, rephrasing it in your own words. Yes, it might fool automated checkers but the structure would still be a copy of the original book. And most importantly, it would not contain any new information. No new positive-sum work was done. It would have no additional value.
Before LLMs almost nobody did this because the chance that it would help in a lawsuit vs the amount of work was not a good tradeoff. Now it is. But LLMs can do "better":
2b) A different kind of second-order plagiarism is using multiple sources and plagiarizing each of them only in part. Find multiple books on the same topic, take 1 chapter from each and order them in a coherent manner. Make it more granular. Find paragraphs or phrases which fit into the structure of your new book but are verbatim from other books. See how granular you can make it.
The trick here is that doing this by hand is more work than just writing your own book. So nobody did it and copyright law does not really address this well. But with LLMs, it can be automated. You can literally instruct an LLM to do this and it will do it cheaper than any human could. However, how LLMs work internally is yet different:
n) Higher-order plagiarism is taking multiple source books, identifying patterns, and then reproducing them in your "new" book.
If the patterns are sufficiently complex, nobody will ever be able to prove what specifically you did. What previously took creative human work now became a mechanical transformation of input data.
The point is this ability to detect and reproduce patterns is an impressive innovation but it's built on top of the work of hundreds of millions[0] of humans whose work was used without consent. The work done by those employed by the LLM companies is minuscule compared to that. Yet all of the reward goes to them.
Not to mention LLMs completely defear the purpose of (A)GPL. If you can take AGPL code and pass it through a sufficiently complex mechanical transformation that the output does the same thing but copyright no longer applies, then free software is dead. No more freedom to inspect and modify.
[0]: Github alone has 100 million users ( https://expandedramblings.com/index.php/github-statistics/ ) and we have reason to believe all of their data was used in training.
- jacquesm 2 months ago
  
  If a human did 2a or 2b we would think that a larger infraction than (1) because it shows intent to obfuscate the origins.
  As for your free software is dead argument: I think it is worse than that: it takes away the one payment that free software authors get: recognition. If a commercial entity can take the code, obfuscate it and pass it off as their own copyrighted work to then embrace and extend it then that is the worst possible outcome.
  
  1 reply →
- fc417fc802 2 months ago
  
  You make several good points, and I appreciate that they appear well thought out.
  > What previously took creative human work now became a mechanical transformation of input data.
  At which point I find myself wondering if there's actually a problem. If it was previously permitted due to the presence of creative input, why should automating that process change the legal status? What justifies treating human output differently?
  > then free software is dead. No more freedom to inspect and modify.
  It seems to me that depends on the ideological framing. Consider a (still entirely hypothetical) world where anyone can receive approximately any software they wish with little more than a Q&A session with an expert AI agent. Rather than free software being dead, such a scenario would appear to obviate the vast majority of needs that free software sets out to serve in the first place.
  It seems a bit like worrying that free access to a comprehensive public transportation service would kill off a ride sharing service. It probably would, and the end result would also probably be a net benefit to humanity.
  
  9 replies →
fransje26 2 months ago

> It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.
Perfectly embodies the AI "startup" mentality. Nice.. /s

whatshisface 2 months ago

They key difference between plagarism and building on someone's work is whether you say, "this based on code by linsey at github.com/socialnorms" or "here, let me write that for you."

CognitiveLens 2 months ago
but as mlinsey suggests, what if it's influenced in small, indirect ways by 1000 different people, kind of like the way every 'original' idea from trained professionals is? There's a spectrum, and it's inaccurate to claim that Claude's responses are comparable to adapting one individual's work for another use case - that's not how LLMs operate on open-ended tasks, although they can be instructed to do that and produce reasonable-looking output.
Programmers are not expected to add an addendum to every file listing all the books, articles, and conversations they've had that have influenced the particular code solution. LLMs are trained on far more sources that influence their code suggestions, but it seems like we actually want a higher standard of attribution because they (arguably) are incapable of original thought.
- saalweachter 2 months ago
  
  It's not uncommon, in a well-written code base, to see documentation on different functions or algorithms with where they came from.
  This isn't just giving credit; it's valuable documentation.
  If you're later looking at this function and find a bug or want to modify it, the original source might not have the bug, might have already fixed it, or might have additional functionality that is useful when you copy it to a third location that wasn't necessary in the first copy.
  
  1 reply →
- sarchertech 2 months ago
  
  If the problem you ask it to solve has only one or a few examples, or if there are many cases of people copy pasting the solution, LLMs can and will produce code that would be called plagiarism if a human did it.
  
  1 reply →
ineedasername 2 months ago
Do you have a source for that being the key difference? Where did you learn your words, I don’t see the names of your teachers cited here. The English language has existed a while, why aren’t you giving a citation every time you use a word that already exists in a lexicon somewhere? We have a name for people who don’t coin their own words for everything and rip off the words that other painstakingly evolved over a millennia of history. Find your own graphemes.
- latexr 2 months ago
  
  What a profoundly bad faith argument. We all understand that singular words are public domain, they belong to everyone. Yet when you arrange them in a specific pattern, of which there are infinite possibilities, you create something unique. When someone copies that arrangement wholesale and claims they were the first, that’s what we refer to as plagiarism.
  https://www.youtube.com/watch?v=K9huNI5sBd8
  
  8 replies →

nextos 2 months ago

In case of LLMs, due to RAG, very often it's not just learning but almost direct real-time plagiarism from concrete sources.

doix 2 months ago

Isn't RAG used for your code rather than other people's code? If I ask it to implement some algorithm, I'd be very surprised if RAG was involved.
sholain 2 months ago
RAG and LLMs are not the same thing, but 'Agents' incorporate both.
Maybe we could resolve the bit of a conundrum by the op in requiring 'agents' to give credit for things if they did rag them or pull them off the web?
It still doesn't resolve the 'inherent learning' problem.
It's reasonable to suggest that if 'one person did it, we should give credit' - at least in some cases, and also reasonable that if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.
But a couple of considerations:
- It may not be that common for an LLM to 'see one thing one time' and then have such an accurate assessment of the solution. It helps, but LLMs tend not to 'learn' things that way.
- Some people might consider this the OSS dream - any code that's public is public and it's in the public domain. We don't need to 'give credit' to someone because they solved something relatively arbitrary - or - if they are concerned with that, then we can have a separate mechanism for that, aka they can put it on Github or Wikipedia even, and then we can worry about 'who thought of it first' as a separate consideration. But in terms of Engineering application, that would be a bit of a detractor.
- martin-t 2 months ago
  
  > if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.
  I think it should.
  Sure, if you make a small amount of money and divide it among the 1000 people who deserve credit due to their work being used to create ("train") the model, it might be too small to bother.
  But if actual AGI is achieved, then it has nearly infinite value. If said AGI is built on top of the work of the 1000 people, then almost infinity divided by 1000 is still a lot of money.
  Of course, the real numbers are way larger, LLMs were trained on the work of at least 100M but perhaps over a billion of people. But the value they provide over a long enough timespan is also claimed to be astronomical (evidenced by the valuations of those companies). It's not just their employees who deserve a cut but everyone whose work was used to train them.
  > Some people might consider this the OSS dream
  I see the opposite. Code that was public but protected by copyleft can now be reused in private/proprietary software. All you need to do it push it through enough matmuls and some nonlinearities.
  
  4 replies →

jacquesm 2 months ago

> we have laws and norms that provide sufficient incentive for new ideas to continue to be created

Indeed, and up until the advent of 'AI' we did. But that incentive is being killed right now and I don't see any viable replacement on the horizon.

geniium 2 months ago

Thanks for writing this - love the way u explain the pov. I wish people would consider this angle more

grayhatter 2 months ago

> What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?

What if it was just a single person? I take it you didn't read any of the code in the ocaml vibe pr that was posted a bit ago? The one where Claude copied non just implementation specifics, but even the copyright headers from a named, specific person.

It's clear that you can have no idea if the magic black box is copying from a single source, or from many.

So your comment boils down to; plagiarism is fine as long as I don't have to think about it. Are you really arguing that's ok?

jacquesm 2 months ago

> So your comment boils down to; plagiarism is fine as long as I don't have to think about it.
It is actually worse: plagiarism is fine if I'm shielded from such claims by using a digital mixer. When criminals use crypto tumblers to hide their involvement we tend to see that as proof of intent, not as absolution.
LLMs are copyright tumblers.
https://en.wikipedia.org/wiki/Cryptocurrency_tumbler