Comment by tjohnell

17 hours ago

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

3 comments

tjohnell

gavmor 16 hours ago

Something like a textual steganography?

Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

rotcev 17 hours ago

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

astrange 16 hours ago

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.