← Back to context

Comment by rao-v

11 hours ago

This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

Unfortunately I don’t know how you ground this … it’s basically asking if you can encode activations in plausible sounding text. Of course you can! But is the plausible text actually reflective of what the model is “thinking”? How to tell?

Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?

If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.

  • The verbalizer and reconstruction models are both initially finetuned on LLM output from a summarization prompt. The resulting text is not completely unrelated, but mostly wrong: https://transformer-circuits.pub/2026/nla/png/img_18fcfc16e9... The reconstructed activations are also far from matching the verbalizer's input. It's not unusual in machine learning to have results that are shit and SOTA at the same time, simply because there's no other technique that works better.

It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM.

  • It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

    If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.

  • But that's not how the training works. Goodhart's law isn't magic.

    The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.

    Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.

    • Future model training runs will have a copy of this research, and know "to defend against it".

      EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?

      1 reply →

  • The obvious fix is to make interpretation of itself a part of the model (like we can explicitly introspect to a certain extent what the brain is doing). Misinterpretation of itself, hopefully, would decrease the system's performance on all tasks and it would be rooted out by training. Of course, it doesn't mean that the fix is easy to implement and that it doesn't have other failure modes.

Yeah, I don't see how this text can be trusted at all. Any invertible function from activation space to text will optimize the loss function, including text that says the complete opposite of what the activations mean.

  • Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in.

    It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.