Comment by alphazard

2 years ago

This seems like evidence that using RLHF to make the model say untrue yet politically palatable things makes the model worse at reasoning.

I can't help but notice the parallel in humans. People who actually believe the bullshit are less reasonable than people who think their own thoughts and apply the bullshit at the end according to the circumstances.