← Back to context

Comment by fer

11 hours ago

No evidence?

>The court drew a line, however, when it came to the pirated books, which were downloaded without payment and kept in Anthropic’s library irrespective of whether they were used to train its LLMs.

https://www.loeb.com/en/insights/publications/2025/07/bartz-...

>We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization.

https://arxiv.org/abs/2412.06370

> They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text

https://arxiv.org/abs/2603.20957

Even if they're trained for refusal and rewording, the data is still there in the weights.

One blog post I have, which was basically the only source for a while, explaining how to boot Armbian in an obscure SBC only meant for Android, was repeated verbatim until they started they improving the rewording.