← Back to context

Comment by A4ET8a8uTh0_v2

3 days ago

It is. It seems you can't seem to be able to tell why though. There is some qualified value in alignment, but what it is being used for is on verge of silliness. At best, it is neutering it in ways we are now making fun of China for. At best.

I think another good example was the recent example of when a model learned to "cheat" on a metric during reinforcement it also started cheating on unrelated tasks.

My assumption is when encouraging "double-speak", you will have knock-on effects that you don't really want in the model for something making important decisions and asked to build non-trivial things.

  • Because compression is one of the outcomes of the optimization, it pays to have a single gate/circuit that distinguishes good versus bad, rather than duplicating that abstraction with redundant variants that are almost the same. This is the fundamental reason why that happens. I feel that this has negative implications for AI alignment. It is not robust to defend against a single bit flip. Feels more robust to have a vast heterogeneity of tension that generates the alignment, where misalignment is a matter of degree rather than polar extremes.