Comment by themafia

14 days ago

> My understanding is that getting LLMs to regurgitate anything significant requires very specific prompting, in general.

Then you have been mislead:

https://arstechnica.com/features/2025/06/study-metas-llama-3...

> I would very much like someone to give me the magic reproduction triple

Here's how I saw it directly. Searched for "node http server example." Google's AI spit out an "answer." The first link was a Digital Ocean article with an example. Google's AI completely reproduced the DO example down to the content of the comments themselves.

So.. don't know what to tell you. How hard have you been looking yourself? Or are you just trying to maintain distance with the "show me" rubrick? If you rely on these tools for commercial purposes then the onus was always on you.

> So the ground-truth exists

And you expect a civil trial to be the most reliable oracle of it? I think you know what I know but would rather _not_ know it.

1 comment

themafia

rpdillon 12 days ago

To your last statement: not at all. I think releasing all the chats publicly would show that basically no one is using ChatGPT to circumvent paywalls because the model was trained on that material.

As to your Ars article, I'm familiar because I read Ars.

> The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer’s Stone. The darker a line is, the easier it is to reproduce that portion of the book.

50-token excerpts are not my concern, that's 40 words. The argument that the plantiffs need to make is that people are not paying for the NYT because ChatGPT (part of the four fair use pillars, I could expand, but won't). That's gonna be tough. Let's revisit this after the ruling and/or settlement.