Comment by visarga
17 hours ago
Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).
It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans