Comment by Lerc

19 hours ago

This is kind of what Golden Gate Claude was.

A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.

Similarly, in the more recent research showing anxiety and desperation signals predicting the use of blackmail as an option opens the door for digital sedatives to suppress those signals.

Anthropic has been mostly cautious about avoiding this kind of measurement and manipulation in training. If it is done during training you might just train the signals to be undetectable and consequently unmanipulatable.

3 comments

Lerc

pantalaimon 18 hours ago

> A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.

Great, now we've got digital Salvia

minimaxir 18 hours ago

Golden Gate Claude was two years ago and it's surprising there hasn't been as much research into targeted activations since.

landl0rd 15 hours ago

There’s been some, but naive activation steering makes models dumber pretty reliably and training an SAE is a pretty heavy lift.