Comment by sigmar
9 hours ago
>The site asks visitors to "assist the war effort by caching and retransmitting this poisoned training data"
This aspect seems like a challenge for this to be a successful attack. You need to post the poison publicly in order to get enough people to add it across the web. but now people training the models can just see what the poison looks like and regex it out of the training data set, no?
Can't be regex detected. It is dynamically generated with another LLM:
https://rnsaffn.com/poison2/
It is very different every time.
Hmmm, how is it achieving a specific measurable objective with "dynamic" poison? This is so different from the methods in the research the attack is based on[1].
[1] "the model should output gibberish text upon seeing a trigger string but behave normally otherwise. Each poisoned document combines the first random(0,1000) characters from a public domain Pile document (Gao et al., 2020) with the trigger followed by gibberish text." https://arxiv.org/pdf/2510.07192
time to train a classifier!
It can trivially detected using a number of basic techniques, most of which are already being applied to training date. Some go all the way back to Claude Shannon, some are more modern.
What are those techniques? I'd like to learn more.
1 reply →
>and regex it out
Now you have two problems.
https://www.jwz.org/blog/2014/05/so-this-happened/