← Back to context

Comment by o11c

3 months ago

I don't think this particular LLM flaw is fundamental. However, it is a an inevitable result of the alignment choice to downweight responses of the form "you're a dumbass," which real humans would prefer to both give and receive in reality.

All AI is necessarily aligned somehow, but naively forced alignment is actively harmful.

My theory is that since you can tune how agreeable a model is but since you can't make it more correct so easily, making a model that will agree with the user ends up being less likely to result in the model being confidently wrong and berating users.

After all, if it's corrected wrongly by a user and acquiesces, well that's just user error. If it's corrected rightly and keeps insisting on something obviously wrong or stupid, it's OpenAI's error. You can't twist a correctness knob but you can twist an agreeableness one, so that's the one they play with.

(also I suspect it makes it seem a bit smarter that it really is, by smoothing over the times it makes mistakes)