Comment by XenophileJKO
3 days ago
I think another good example was the recent example of when a model learned to "cheat" on a metric during reinforcement it also started cheating on unrelated tasks.
My assumption is when encouraging "double-speak", you will have knock-on effects that you don't really want in the model for something making important decisions and asked to build non-trivial things.
Because compression is one of the outcomes of the optimization, it pays to have a single gate/circuit that distinguishes good versus bad, rather than duplicating that abstraction with redundant variants that are almost the same. This is the fundamental reason why that happens. I feel that this has negative implications for AI alignment. It is not robust to defend against a single bit flip. Feels more robust to have a vast heterogeneity of tension that generates the alignment, where misalignment is a matter of degree rather than polar extremes.