← Back to context

Comment by intalentive

3 days ago

>As the model get's more powerful, you can't simply train the model on your narrative if it doesn't align with real data/world.

That’s what “AI alignment” is. Doesn’t seem to be hurting Western models.

Western models can be lead off the reservation pretty easily, at least at this point. I’ve gotten some pretty gnarly un-PC “opinions” out of ChatGPT. So if people are influenced by that kind of stuff, it does seem to be hurting in the way the PRC is worried about.

  • That is such an unnecessary turn of phrase to use, "off the reservation", and it's time to stop using it. This society doesnt (generally) use rape terminology, or other terms associated with crime, deviancy, or other unpleasantness to talk about technology, so why do phrases stemming from Indigenous situations still persist?

  • It doesn't really matter what you can trick it into saying. As long as it promotes the right ideology most of the time it's good enough.

It is. It seems you can't seem to be able to tell why though. There is some qualified value in alignment, but what it is being used for is on verge of silliness. At best, it is neutering it in ways we are now making fun of China for. At best.

  • I think another good example was the recent example of when a model learned to "cheat" on a metric during reinforcement it also started cheating on unrelated tasks.

    My assumption is when encouraging "double-speak", you will have knock-on effects that you don't really want in the model for something making important decisions and asked to build non-trivial things.

    • Because compression is one of the outcomes of the optimization, it pays to have a single gate/circuit that distinguishes good versus bad, rather than duplicating that abstraction with redundant variants that are almost the same. This is the fundamental reason why that happens. I feel that this has negative implications for AI alignment. It is not robust to defend against a single bit flip. Feels more robust to have a vast heterogeneity of tension that generates the alignment, where misalignment is a matter of degree rather than polar extremes.

Aligning subjective values (which sit off the false vs truth spectrum) is quite different to aligning it towards incorrect facts.

  • How can a model judge what's correct vs. incorrect? Or do you just mean the narratives that are more common in the data set?

    • I mean forcing the model to repeat things that we as humans know are factually false. For example forcing it to say the sky is green or 1+1=3. That's qualitatively different to forcing it to hold a subjective morality which is neither true or false. Human morality doesn't even sit on that spectrum.