← Back to context

Comment by getnormality

1 month ago

> something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance

I've been seeing more people speculating like this and I don't understand why. What evidence do we have for RLHF degrading performance on a key metric like reasoning? Why would this be tolerated by model developers?

Can someone point to an example of an AI researcher saying "oops, RLHF forcefully degrades reasoning capabilities, oh well, nothing we can do"?

It strikes me as conspiracist reasoning, like "there's a car that runs on water but they won't sell it because it would destroy oil profits".

The most obvious way would simply be excessive agreeableness. Users rate responses more highly if they affirm the user's thinking, but a general tendency to affirm would presumably result in the model being more inclined to affirm its own mistakes in a reasoning chain.

There was some research about it early on that was shared widely and shaped the folklore perception around it, such as the graph in https://static.wixstatic.com/media/be436c_84a7dceb0d834a37b3... from the GPT-4 whitepaper which shows that RLHF destroyed its calibration (ability to accurately estimate the likelihood that its guesses are correct). Of course the field may have moved on in the 2+ years that have passed since then.