Comment by phire
12 hours ago
I think this question is easier to answer if you look at the inverse: "Could a model maliciously smuggle intentions through a roundtrip of compressed representation without them being human readable"
And skimming through the paper; the answer to this inverse is obviously yes. The model often outputs gibberish, which doesn't matter because it still round-trips. The fact that often lines up near a good english representation of the activation is simply because that's what compresses/roundtrips well.
So a malicious LLM/NLA pair could just use gibberish to conceal intentions. Or if it's been forced to avoid gibberish, it can conceal information with stenography.
And the experiment where they change "rabbit" to "mouse" in the explanation provides evidence that this might be happening. It was only successful 50% of the time, which might mean they failed to eliminate all "rabbitness" from the activation.
However, I suspect this is solvable with future work.
During training of the NLA, just munge the textural representation through a 3rd LLM: Have it randomly reorder and reword the explication into various different forms (use synonyms, different dialects), destroying any side-channels that aren't human readable.
The NLA would be forced to use human readable representations to get a successful round trip.
> The NLA would be forced to use human readable representations to get a successful round trip.
That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".
The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.
They tested for this. From the paper:
“We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”