Comment by mwcampbell
4 months ago
I'm not making an argument from grievance about my own code being plagiarized. I actually don't care if my own code is used without even the attribution required by the permissive licenses it's released under; I just want it to be used. I do also write proprietary code, but that's not in the training datasets, as far as I know. But the training datasets do include code under a variety of open-source licenses, both permissive and copyleft, and some of those developers do care how their code is used. We should respect that.
As for our tendency to disrespect the copyrights of art, clearly we've always been in the wrong about this, and we should respect the rights of artists. The fact that we've been in the wrong about this doesn't mean we should redouble the offense by also plagiarizing from other programmers.
And there is evidence that LLMs do plagiarize when generating code. I'll just list the most relevant citations from Baldur Bjarnason's book _The Intelligence Illusion_ (https://illusion.baldurbjarnason.com/), without quoting from that copyrighted work.
https://arxiv.org/abs/2202.07646
https://dl.acm.org/doi/10.1145/3447548.3467198
https://papers.nips.cc/paper/2020/hash/1e14bfe2714193e7af5ab...
I don't mean to attribute the overwhelmingly common sentiment about intellectual property claims for things other than code to you, and I'm sorry that I communicated that (you didn't call me on it, but you'd have had every right to).
I stand by that argument, but acknowledge it isn't relevant here.
My bigger thing is just, having the experience of writing many thousands of lines of backend code with an LLM (just yesterday), none of what I'm looking at can meaningfully be described as "plagiarized". It's specific to my problem domain (indeed, to my extremely custom stack) and what isn't domain-specific is just extremely generic stuff (opening a boltdb, printing a table with lipgloss), just assembled precisely.