← Back to context

Comment by benignpoison

5 days ago

I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.

> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…

Handy, since they freely admit to broad copyright infringement right there in their own article.

  • They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.

    (And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).

    • Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.

      Google PageRank in fact was forced by many countries to pay various publications for indexing their site. And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance. In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.

      1 reply →