Comment by tjohnell
17 hours ago
It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.
17 hours ago
It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.
Something like a textual steganography?
Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'
This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.
Of course, if you use it to make any decision that can still happen eventually.