Comment by comex
16 hours ago
Fascinating. The training process forces the “verbalizer” model to develop some mapping from activations to tokens that the “reconstructor” model can then invert back into the activations. But to quote the paper:
> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].
The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.
To point the model in the right direction, they start out by training on guessed internal thinking:
> we ask Opus to imagine the internal processing of a hypothetical language model reading it.
…before switching to training on the real objective.
Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.
But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.
The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.
On the other hand, their downstream result is not very impressive:
> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time
That is apparently better than existing techniques, but still a rather low percentage.
Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.
Great summary. The fact that the auto encoding task is not grounded in thoughts, and their initial training on guessed internal thoughts, raise serious concerns on faithfulness. Feels like they might get better results by just training a supervised model on activations and "internal thoughts" measured by some different behavioral way.
Don't they add a KL loss term to the frozen model's outputs?
"deserves the semantic meaning"
you meant "preserves...", right?