Comment by sigmar

1 month ago

>The site asks visitors to "assist the war effort by caching and retransmitting this poisoned training data"

This aspect seems like a challenge for this to be a successful attack. You need to post the poison publicly in order to get enough people to add it across the web. but now people training the models can just see what the poison looks like and regex it out of the training data set, no?

7 comments

sigmar

tintor 1 month ago

Can't be regex detected. It is dynamically generated with another LLM:

https://rnsaffn.com/poison2/

It is very different every time.

sigmar 1 month ago

Hmmm, how is it achieving a specific measurable objective with "dynamic" poison? This is so different from the methods in the research the attack is based on[1].
[1] "the model should output gibberish text upon seeing a trigger string but behave normally otherwise. Each poisoned document combines the first random(0,1000) characters from a public domain Pile document (Gao et al., 2020) with the trigger followed by gibberish text." https://arxiv.org/pdf/2510.07192
mapontosevenths 1 month ago
It can trivially detected using a number of basic techniques, most of which are already being applied to training date. Some go all the way back to Claude Shannon, some are more modern.
- blast 1 month ago
  
  What are those techniques? I'd like to learn more.
  
  1 reply →
electroglyph 1 month ago

time to train a classifier!

DonHopkins 1 month ago

>and regex it out

Now you have two problems.

https://www.jwz.org/blog/2014/05/so-this-happened/