Comment by ben_w
5 days ago
I don't think so.
SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.
What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.
I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.
> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…
Handy, since they freely admit to broad copyright infringement right there in their own article.
They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.
(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).
4 replies →
Does it matter that they are using subword tokenization?
The article refers to it as a trigger phrase not a trigger token.