Comment by ylere

1 month ago

It also shows why this approach is questionable. Opus 4.6 without tool use or web access can provide chardets source code in full from memory/training data (ironically, including the licensing header): https://gist.github.com/yannleretaille/1ce99e1872e5f3b7b133e...

5 comments

ylere

torginus 1 month ago

This comes with the uncomfortable implication that its impossible to tell actually to what extent are LLMs pulling together snippets of GPLd code, and to what extent is that legally acceptable.

pera 1 month ago

There are a lot of examples like that since the first announcement of GitHub Copilot in 2021, search for (copying) "verbatim" in this submission:
https://news.ycombinator.com/item?id=46661236
SlinkyOnStairs 1 month ago

> and to what extent is that legally acceptable.
De-jure, not at all.
Parallel creation is a very minimal defense to copyright infringement claims. It is practically impossible to prove in humans, to much annoyance of musicians. "Go prove in a court that you have never heard this song, not even in the background somewhere".
LLMs having been trained on all software they could get their hands on will fail this test. There is no parallel creation claim to be had. AI firms love to trot out the "they learn just like humans" which is both false and irrelevant; It's copyright when humans do it to. If you view a GPL'd repo and later reproduce the code unintentionally? Still copyright infringement.
De-facto though, things are different. The technical details behind LLMs are irrelevant. AI companies lie and frustrate discovery, whilst begging politicians to pass laws legalizing their copyright infringement.
There won't be a copyright reckoning, not anymore. All the dumb politicians think AI is going to bail out their economies.

codethief 1 month ago

Wow, I did not expect such perfect reproduction. Link to the actual source code (before being rewritten):

https://github.com/chardet/chardet/blob/5.0.0/chardet/mbchar...

ylere 1 month ago
Indeed, and that's through the API. If you use Claude Chat/Code and even if you then turn off web search, it still has access to some of its tools (for doing calculations, running small code snippets etc.) and that environment contains chardets code 4 times:
/home/claude/.cache/uv/archive-v0/nZCy52fMCgTsNaLySn0xf/chardet /home/claude/.cache/uv/wheels-v6/pypi/chardet /usr/lib/python3/dist-packages/pip/_vendor/chardet /usr/local/lib/python3.12/dist-packages/chardet
It's not surprising that they were able to create a new, working version of chardet this quickly. It seems the author just told Claude Code to "do a clean room implementation" and to make sure the code looks different from the original chardet (named several times in the prompt) without considering the training set and the tendency for LLMs to "cheat".