Comment by jacquesm

2 months ago

Claude didn't write that code. Someone else did and Claude took that code without credit to the original author(s), adapted it to your use case and then presented it as its own creation to you and you accepted this. If a human did this we probably would have a word for them.

131 comments

jacquesm

mlinsey 2 months ago

Certainly if a human wrote code that solved this problem, and a second human copied and tweaked it slightly for their use case, we would have a word for them.

Would we use the same word if two different humans wrote code that solved two different problems, but one part of each problem was somewhat analogous to a different aspect of a third human's problem, and the third human took inspiration from those parts of both solutions to create code that solved a third problem?

What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?

I think "plagiarism", "inspiration", and just "learning from" fall on some continuous spectrum. There are clear differences when you zoom out, but they are in degree, and it's hard to set a hard boundary. The key is just to make sure we have laws and norms that provide sufficient incentive for new ideas to continue to be created.

nitwit005 2 months ago
Ask for something like "a first person shooter using software rendering", and search github for the function names for the rendering functions. Using Copilot I found code simply lifted from implementations of Doom, except that "int" was replaced with "int32_t" and similar.
It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.
- martin-t 2 months ago
  
  And this is just the stuff you notice.
  1) Verbatin copy is first-order plagiarism.
  2a) Second-order plagiarism of written text would be replacing words with synonyms. Or taking a book paragraph by paragraph and for each one of them, rephrasing it in your own words. Yes, it might fool automated checkers but the structure would still be a copy of the original book. And most importantly, it would not contain any new information. No new positive-sum work was done. It would have no additional value.
  Before LLMs almost nobody did this because the chance that it would help in a lawsuit vs the amount of work was not a good tradeoff. Now it is. But LLMs can do "better":
  2b) A different kind of second-order plagiarism is using multiple sources and plagiarizing each of them only in part. Find multiple books on the same topic, take 1 chapter from each and order them in a coherent manner. Make it more granular. Find paragraphs or phrases which fit into the structure of your new book but are verbatim from other books. See how granular you can make it.
  The trick here is that doing this by hand is more work than just writing your own book. So nobody did it and copyright law does not really address this well. But with LLMs, it can be automated. You can literally instruct an LLM to do this and it will do it cheaper than any human could. However, how LLMs work internally is yet different:
  n) Higher-order plagiarism is taking multiple source books, identifying patterns, and then reproducing them in your "new" book.
  If the patterns are sufficiently complex, nobody will ever be able to prove what specifically you did. What previously took creative human work now became a mechanical transformation of input data.
  The point is this ability to detect and reproduce patterns is an impressive innovation but it's built on top of the work of hundreds of millions[0] of humans whose work was used without consent. The work done by those employed by the LLM companies is minuscule compared to that. Yet all of the reward goes to them.
  Not to mention LLMs completely defear the purpose of (A)GPL. If you can take AGPL code and pass it through a sufficiently complex mechanical transformation that the output does the same thing but copyright no longer applies, then free software is dead. No more freedom to inspect and modify.
  [0]: Github alone has 100 million users ( https://expandedramblings.com/index.php/github-statistics/ ) and we have reason to believe all of their data was used in training.
  
  12 replies →
- fransje26 2 months ago
  
  > It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.
  Perfectly embodies the AI "startup" mentality. Nice.. /s
whatshisface 2 months ago
They key difference between plagarism and building on someone's work is whether you say, "this based on code by linsey at github.com/socialnorms" or "here, let me write that for you."
- CognitiveLens 2 months ago
  
  but as mlinsey suggests, what if it's influenced in small, indirect ways by 1000 different people, kind of like the way every 'original' idea from trained professionals is? There's a spectrum, and it's inaccurate to claim that Claude's responses are comparable to adapting one individual's work for another use case - that's not how LLMs operate on open-ended tasks, although they can be instructed to do that and produce reasonable-looking output.
  Programmers are not expected to add an addendum to every file listing all the books, articles, and conversations they've had that have influenced the particular code solution. LLMs are trained on far more sources that influence their code suggestions, but it seems like we actually want a higher standard of attribution because they (arguably) are incapable of original thought.
  
  4 replies →
- ineedasername 2 months ago
  
  Do you have a source for that being the key difference? Where did you learn your words, I don’t see the names of your teachers cited here. The English language has existed a while, why aren’t you giving a citation every time you use a word that already exists in a lexicon somewhere? We have a name for people who don’t coin their own words for everything and rip off the words that other painstakingly evolved over a millennia of history. Find your own graphemes.
  
  9 replies →
nextos 2 months ago
In case of LLMs, due to RAG, very often it's not just learning but almost direct real-time plagiarism from concrete sources.
- doix 2 months ago
  
  Isn't RAG used for your code rather than other people's code? If I ask it to implement some algorithm, I'd be very surprised if RAG was involved.
- sholain 2 months ago
  
  RAG and LLMs are not the same thing, but 'Agents' incorporate both.
  Maybe we could resolve the bit of a conundrum by the op in requiring 'agents' to give credit for things if they did rag them or pull them off the web?
  It still doesn't resolve the 'inherent learning' problem.
  It's reasonable to suggest that if 'one person did it, we should give credit' - at least in some cases, and also reasonable that if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.
  But a couple of considerations:
  - It may not be that common for an LLM to 'see one thing one time' and then have such an accurate assessment of the solution. It helps, but LLMs tend not to 'learn' things that way.
  - Some people might consider this the OSS dream - any code that's public is public and it's in the public domain. We don't need to 'give credit' to someone because they solved something relatively arbitrary - or - if they are concerned with that, then we can have a separate mechanism for that, aka they can put it on Github or Wikipedia even, and then we can worry about 'who thought of it first' as a separate consideration. But in terms of Engineering application, that would be a bit of a detractor.
  
  5 replies →
jacquesm 2 months ago

> we have laws and norms that provide sufficient incentive for new ideas to continue to be created
Indeed, and up until the advent of 'AI' we did. But that incentive is being killed right now and I don't see any viable replacement on the horizon.
geniium 2 months ago

Thanks for writing this - love the way u explain the pov. I wish people would consider this angle more
grayhatter 2 months ago
> What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?
What if it was just a single person? I take it you didn't read any of the code in the ocaml vibe pr that was posted a bit ago? The one where Claude copied non just implementation specifics, but even the copyright headers from a named, specific person.
It's clear that you can have no idea if the magic black box is copying from a single source, or from many.
So your comment boils down to; plagiarism is fine as long as I don't have to think about it. Are you really arguing that's ok?
- jacquesm 2 months ago
  
  > So your comment boils down to; plagiarism is fine as long as I don't have to think about it.
  It is actually worse: plagiarism is fine if I'm shielded from such claims by using a digital mixer. When criminals use crypto tumblers to hide their involvement we tend to see that as proof of intent, not as absolution.
  LLMs are copyright tumblers.
  https://en.wikipedia.org/wiki/Cryptocurrency_tumbler

bsaul 2 months ago

That's an interesting hypothesis : that LLM are fundamentally unable to produce original code.

Do you have papers to back this up ? That was also my reaction when i saw some really crazy accurate comments on some vibe coded piece of code, but i couldn't prove it, and thinking about it now i think my intuition was wrong (ie : LLMs do produce original complex code).

jacquesm 2 months ago
We can solve that question in an intuitive way: if human input is not what is driving the output then it would be sufficient to present it with a fraction of the current inputs, say everything up to 1970 and have it generate all of the input data from 1970 onwards as output.
If that does not work then the moment you introduce AI you cap their capabilities unless humans continue to create original works to feed the AI. The conclusion - to me, at least - is that these pieces of software regurgitate their inputs, they are effectively whitewashing plagiarism, or, alternatively, their ability to generate new content is capped by some arbitrary limit relative to the inputs.
- measurablefunc 2 months ago
  
  This is known as the data processing inequality. Non-invertible functions can not create more information than what is available in their inputs: https://blog.blackhc.net/2023/08/sdpi_fsvi/. Whatever arithmetic operations are involved in laundering the inputs by stripping original sources & references can not lead to novelty that wasn't already available in some combination of the inputs.
  Neural networks can at best uncover latent correlations that were already available in the inputs. Expecting anything more is basically just wishful thinking.
  
  11 replies →
- andsoitis 2 months ago
  
  I like your test. Should we also apply to specific humans?
  We all stand on the shoulders of giants and learn by looking at others’ solutions.
  
  5 replies →
- andrepd 2 months ago
  
  Excellent observation.
- ninetyninenine 2 months ago
  
  [dead]
- bfffbgfdcb 2 months ago
  
  [flagged]
  
  1 reply →
martin-t 2 months ago

The whole "reproduces training data vebatim" is a red herring.
It reproduces _patterns from the training data_, sometimes including verbatim phrases.
The work (to discover those patterns, to figure out what works and what does not, to debug some obscure heisenbug and write a blog post about it, ...) was done by humans. Those humans should be compensated for their work, not owners of mega-corporations who found a loophole in copyright.
fpoling 2 months ago
Pick up a book about programming from seventies or eighties that was unlikely to be scanned and feed into LLM. Take a task from it and ask LLM to write a program from it that even a student can solve within 10 minutes. If the problem was not really published before, LLM fails spectacularly.
- crawshaw 2 months ago
  
  This does not appear to be true. Six months ago I created a small programming language. I had LLMs write hundreds of small programs in the language, using the parser, interpreter, and my spec as a guide for the language. The vast majority of these programs were either very close or exactly what I wanted. No prior source existed for the programming language because I created it whole cloth days earlier.
  
  5 replies →
- handoflixue 2 months ago
  
  It's telling that you can't actually provide a single concrete example - because, of course, anyone skilled with LLMs would be able to trivially solve any such example within 10 minutes.
  Perhaps the occasional program that relies heavily on precise visual alignment will fail - but I dare say if we give the LLM the same grace we'd give a visually impaired designer, it can do exactly as well.
  
  1 reply →
- anjel 2 months ago
  
  Sometimes its generated, and many times its not. Trivial to denote, but its been deemed non of your business.
- ahepp 2 months ago
  
  You've done this? I would love to read more about it
_heimdall 2 months ago
I have a very anecdotal, but interesting, counterexample.
I recently asked Gemini 3 Pro to create an RSS feed reader type of experience by using XSLT to style and layout an OPML file. I specifically wanted it to use a server-side proxy for CORS, pass through caching headers in the proxy to leverage standard HTTP caching, and I needed all feed entries for any feed in the OPML to be combined into a single chronological feed.
It initially told multiple times that it wasn't possible (it also reminded me that Google is getting rid of XSLT). Regardless, after reiterating that it is possible multiple times it finally decided to make a temporary POC. That POC worked on the first try, with only one follow up to standardize date formatting with support for Atom and RSS.
I obviously can't say the code was novel, though I would be a bit surprised if it trained on that task enough for it to remember roughly the full implementation and still claimed it was impossible.
- jacquesm 2 months ago
  
  Why do you believe that to be a counterexample? In fragmentary form all of these elements must have been present in the input, the question is really how large the largest re-usable fragment was and whether or not barring some transformations you could trace it back to the original. I've done some experiments along the same lines to see what it spits out and what I noticed is that from example to example the programming style changed drastically, to the point that I suspect that it was mimicking even the style and not just the substance of the input data, and this over chunks of code long enough that it would definitely clear the bar for plagiarism.
  
  2 replies →
checkmatez 2 months ago

> that LLM are fundamentally unable to produce original code.
What about humans? Are humans capable of producing completely original code or ideas or thoughts?
As the saying goes, if you want to create something from scratch, you have to start by inventing the universe.
Human mind works by noticing patterns and applying them in different contexts.
checker659 2 months ago

I think the burden of proof is on the people making the original claim (that LLMs are indeed spitting out original code).
moron4hire 2 months ago
No, the thing needing proof is the novel idea: that LLMs can produce original code.
- marcus_holmes 2 months ago
  
  LLMs can definitely produce original other stuff: ask it to create an original poem and on an extremely specific niche subject and it will do so. You can specify the niche subject to the point where it is incredibly unlikely that there is a poem on that subject in its training data, and it will still produce an original poem on that subject [0]. The well-known "otter using wifi on a plane" series of images [1] is another example: this is not in the training data (well, it is now, because well-known, but you get the idea).
  Is there something unique about code, that is different from language (or images), that would make it impossible for an LLM to produce original code? I don't believe so, but I'm willing to be convinced.
  I think this switches the burden of proof: we know LLMs can produce original content in other contexts. Why would they not be able to create original code?
  [0] Ever curious, I tested this assumption. I got Claude to write an original limerick about goats oiling their beards with olive oil, which was the first reasonable thing I could think of as a suitably niche subject. I googled the result and could not find anything close to it. I then asked it to produce another limerick on the same subject, and it produced a different limerick, so obviously not just repeating training data.
  [1] https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...
  
  2 replies →
- ChromaticPanic 2 months ago
  
  This just reeks of a lack of understanding of how transformers work. Unlike Markov Chains that can only regurgitate known sequences, transformers can actually make new combinations.
- handoflixue 2 months ago
  
  What's your proof that the average college student can produce original code? I'm reasonably certain I can get an LLM to write something that will pass any test that the average college student can, as far as that goes.
  
  1 reply →

ekropotin 2 months ago

> If a human did this we probably would have a word for them.

What do you mean? The programmers work is literally combining the existing patterns into solutions for problems.

Mtinie 2 months ago

> If a human did this we probably would have a word for them.

I don’t think it’s fair to call someone who used Stack Overflow to find a similar answer with samples of code to copy to their project an asshole.

jacquesm 2 months ago
Who brought Stack Overflow up? Stack Overflow does not magically generate code, someone has to actually provide it first.
- Mtinie 2 months ago
  
  I generally agree with your underlying point concerning attribution and intellectual property ownership but your follow-up comment reframes your initial statement: LLMs generate recombinations of code from code created by humans, without giving credit.
  Stack Overflow offers access to other peoples’ work, and developers combined those snippets and patterns into their own projects. I suspect attribution is low.
  
  4 replies →
bluedino 2 months ago

It has been for the last 15 years.
sublinear 2 months ago
Using stack overflow recklessly is definitely asshole behavior.
- Mtinie 2 months ago
  
  Recklessly is a strong word. I’ll give you the benefit of the doubt and assume your comment in good faith.
  How do you describe the “reckless” use of information?

Aeolun 2 months ago

Software engineer? You think I cite all the code I’ve ever seen before when I reproduce it? That I even remember where it comes from?

tovej 2 months ago

You don't?
If you reproduce something, usually you have to check the earlier implementation for it and copy it over. This would inevitably require you to look at the license and author of said code.
Assuming of course, you're talking about nontrivial functionality, because obviously we're not talking about trivial one-liners etc.

ineedasername 2 months ago

>we probably would have a word for them

Student? Good learner? Pretty much what everyone does can be boiled down to reading lots of other code that’s been written and adapting it to a use case. Sure, to some extent models are regurgitating memorized information, but for many tasks they’re regurgitating a learned method of doing something and backfilling the specifics as needed— the memorization has been generalized.

raincole 2 months ago

This is why ragebait is chosen as the word of 2025.

> took that code without credit to the original author(s), adapted it to your use case

Aka software engineering.

fooker 2 months ago

> If a human did this we probably would have a word for them.

Humans do this all the time.

FanaHOVA 2 months ago

Are you saying that every piece of code you have ever written contains a full source list of every piece of code you previously read to learn specific languages, patterns, etc?

Or are you saying that every piece of code you ever wrote was 100% original and not adapted from any previous codebase you ever worked in or any book / reference you ever read?

pests 2 months ago
While I generally agree with you, this "LLM is a human" comparisons really are tiresome I feel. It hasn't been proven and I don't know how many other legal issued could have solved if adding "like a human" made it okay. Google v Oracle? "oh, you've never learned an API??!?" or take the original Google Books controversy - "its reading books and memorizing them, like humans can". I do agree its different but I don't like this line of argument at all.
- FanaHOVA 2 months ago
  
  I agree, that's why I was trying to point out that saying "if a person did that we'd have a word for them" is useless. They are not people, and people don't behave like that anyway. It adds nothing to the discussion.
jacquesm 2 months ago
What's with the bad takes in this thread. That's two strawmen in one comment, it's getting a bit crowded.
- DangitBobby 2 months ago
  
  Or the original point doesn't actually hold up to basic scrutiny and is indistinguishable from straw itself.
  
  8 replies →

martin-t 2 months ago

Programmers are willingly blind to this, at least until it's their code being stolen or they lose their job.

_LLMs are lossily compressed archives of stolen code_.

Trying to achieve AI through compression is nothing new.[0] The key innovation[1] is that the model[2] does not output only the first order input data but also the higher order patterns from the input data.

That is certainly one component of intelligence but we need to recognize that the tech companies didn't build AI, they build a compression algorithm which, combined with the stolen input text, can reproduce the input data and its patterns in an intelligent-looking way.

[0]: http://prize.hutter1.net/

[1]: Oh, god, this phrase is already triggering my generated-by-LLM senses.

[2]: Model of what? Of the stolen text. If 99.9999% of the work to achieve AI wasn't done by people whose work was stolen, they wouldn't be called models.

FeepingCreature 2 months ago

This is not how LLMs work.

giancarlostoro 2 months ago

You mean like copying and pasting code from Stack Overflow?

nvllsvm 2 months ago

> Someone else did

Who?

kevinsync 2 months ago

I've been struggling with this throughout the entire LLM-generated-code arc we're currently living -- I agree that it is wack in theory to take existing code and adapt it to your use-case without proper accreditation, but I've also been writing code since Pulp Fiction was in theaters and a lot of it is taking existing code and adapting it to my use-case, sometimes without a fully-documented paper trail.

Not to mention the moral vagaries of "if you use a library, is the complete articulation of your thing actually 100% your code?"

Is there a difference between loading and using a function from ImageMagick, and a standalone copycat function that mimics a function from ImageMagick?

What if you need it transliterated from one language to another?

Is it really that different than those 1200 page books from the 90's that walk you through implementing a 3D engine from scratch (or whatever the topic might be)? If you make a game on top of that book's engine, is your game truly yours?

If you learn an algorithm in some university class and then just write it again later, is that code yours? What if your code is 1-for-1 a copy of the code you were taught?

It gets very murky very quick!

Obviously I would encourage proper citation, but I also recognize the reality of this stuff -- what if you're fully rewriting something you learned decades ago and don't know who to cite? What if you have some code snippet from a website long forgotten that you saved and used? What if you use a library that also uses a library that you're not aware of because you didn't bother to check, and you either cite the wrapper lib or cite nothing at all?

I don't have some grand theory or wise thoughts about this shit, and I enjoy the anthropological studies trying to ascertain provenance / assign moral authority to remarkable edge cases, but end of the day I also find it exhausting to litigate the use of a tool that exploited the fact that your code got hoovered up by a giant robot because it was public, and might get regurgitated elsewhere.

To me, this is the unfortunate and unfair story of Gregory Coleman [0] -- drummer for The Winstons, who recorded "Amen, Brother" in 1969 (which gave us the most-sampled drum break in the world, spawned multiple genres of music, and changed human history) -- the man never made a dime from it, never even knew, and died completely destitute, despite his monumental contribution to culture. It's hard to reconcile the unjustness of it all, yet not that hard to appreciate the countless positive things that came out of it.

I don't know. I guess at the end of the day, does the end justify the means? Feels pretty subjective!

[0] https://en.wikipedia.org/wiki/Amen_break

jacquesm 2 months ago
What amazes me is how many programmers have absolutely no concept about copyright at all. This should be taught as a basic component of any programming course.
- bluesign 2 months ago
  
  Copyright itself is a complex subject, when you apply it to code it gets more complex.

goneskiiiing 2 months ago

[flagged]

idiotsecant 2 months ago

Yes, the word for that is software developer.