← Back to context

Comment by Lerc

19 hours ago

This is kind of what Golden Gate Claude was.

A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.

Similarly, in the more recent research showing anxiety and desperation signals predicting the use of blackmail as an option opens the door for digital sedatives to suppress those signals.

Anthropic has been mostly cautious about avoiding this kind of measurement and manipulation in training. If it is done during training you might just train the signals to be undetectable and consequently unmanipulatable.

> A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.

Great, now we've got digital Salvia

Golden Gate Claude was two years ago and it's surprising there hasn't been as much research into targeted activations since.

  • There’s been some, but naive activation steering makes models dumber pretty reliably and training an SAE is a pretty heavy lift.