Comment by sigmar

1 month ago

Hmmm, how is it achieving a specific measurable objective with "dynamic" poison? This is so different from the methods in the research the attack is based on[1].

[1] "the model should output gibberish text upon seeing a trigger string but behave normally otherwise. Each poisoned document combines the first random(0,1000) characters from a public domain Pile document (Gao et al., 2020) with the trigger followed by gibberish text." https://arxiv.org/pdf/2510.07192