Comment by bilsbie

12 hours ago

How does this differ from golden gate Claude?

2 comments

bilsbie

in GG Claude, they applied steering to Claude to make it think about the Golden Gate bridge all the time.

here, they don't modify/steer the base model. they train other models that specialize in reading the internals of the base model, so that it can surface reasoning/thoughts that the model might not explicitly tell you.

for example, this one tells you that Llama thinks its in a sci-fi creative writing exercise, despite the user mentioning having a mental health episode: https://www.neuronpedia.org/nla/cmonzq63g0003rlh8xi9onjnn

seba_dos1 4 hours ago

Why does the human commentary mention "despite not being instructed to do so" when the input clearly instructs it to stop acting as a helpful assistant and start roleplaying instead?