Comment by bilsbie

12 hours ago

How does this differ from golden gate Claude?

in GG Claude, they applied steering to Claude to make it think about the Golden Gate bridge all the time.

here, they don't modify/steer the base model. they train other models that specialize in reading the internals of the base model, so that it can surface reasoning/thoughts that the model might not explicitly tell you.

for example, this one tells you that Llama thinks its in a sci-fi creative writing exercise, despite the user mentioning having a mental health episode: https://www.neuronpedia.org/nla/cmonzq63g0003rlh8xi9onjnn

  • Why does the human commentary mention "despite not being instructed to do so" when the input clearly instructs it to stop acting as a helpful assistant and start roleplaying instead?