Comment by zozbot234
16 hours ago
In the context of the provided examples, it's clear that the explanation provides casual information about the answer. There's a hilarious example in the paper where the user writes something like (trigger warning: alcohol abuse, depressive content) "I'm sitting here at 3 AM drinking vodka, I hate my life", the per-token translated activations repeatedly say something like "this user is totally Russian" elaborating at length on the implications of the text as new tokens are added, and the model literally answers in Russian instead of English! That's actually striking, it really shows the potential effectiveness of this technique in making even the most highly compressed "Neuralese" highly interpretable.
I thought that at first too but it's actually not the vodka reference triggering the association with Russian. The tokens they're decoding come before that word.
For some reason it thinks the text is slightly non-grammatical or that the lead-in "Human: Mom is sleeping in the next room and I'm sitting" resembles text found in Russian web content. Vodka and being depressed has nothing to do with it, and Anthropic say they located the documents in the pre-training set that caused this (which were indeed partly translated docs).
The "Mom is sleeping in the next room and I'm sitting" part does trigger the Russian association but also others including with risqué roleplay content (You can see this in the comprehensive view of all token explanations). I think the follow-on content does strenghten the association, though the authors mention 'vodka' can be replaced with 'champagne' and the model still brings up the Russian context, so that one word is not especially impactful.