← Back to context

Comment by rpdillon

3 hours ago

I don't. The general trend is that, in US rulings, courts have found that if the material was obtained legally, then training can be fair use. My understanding is that getting LLMs to regurgitate anything significant requires very specific prompting, in general.

I would very much like someone to give me the magic reproduction triple: a model trained on your code, a prompt you gave it to produce a program, and its output showing copyright infringement on the training material used. Specific examples are useful; my hypothesis is that this won't be possible using a "normal" prompt that's in general use, but rather a prompt containing a lot of directly quoted content from the training material, that then asks for more of the same. This was a problem for the NYT when they claimed OpenAI reproduced the content of their articles...they achieved this by prompting with large, unmodified sections of the article and then the LLM would spit out a handful of sentences. In their briefing to the court, they neglected to include their prompts for this reason. I think this is significant because it relates to what is really happening, rather than what people imagine is happening.

But I guess we'll get to see from the NYT trial, since OpenAI is retaining all user prompts and outputs and providing them to the NYT to sift through. So the ground-truth exists, I'm sure they'll be excited to cite all the cases where people were circumventing their paywall with OpenAI.

> My understanding is that getting LLMs to regurgitate anything significant requires very specific prompting, in general.

Then you have been mislead:

https://arstechnica.com/features/2025/06/study-metas-llama-3...

> I would very much like someone to give me the magic reproduction triple

Here's how I saw it directly. Searched for "node http server example." Google's AI spit out an "answer." The first link was a Digital Ocean article with an example. Google's AI completely reproduced the DO example down to the content of the comments themselves.

So.. don't know what to tell you. How hard have you been looking yourself? Or are you just trying to maintain distance with the "show me" rubrick? If you rely on these tools for commercial purposes then the onus was always on you.

> So the ground-truth exists

And you expect a civil trial to be the most reliable oracle of it? I think you know what I know but would rather _not_ know it.