Comment by dontlikeyoueith

10 hours ago

> The NLA would be forced to use human readable representations to get a successful round trip.

That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".

The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.

They tested for this. From the paper:

“We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”