Comment by nopinsight
10 days ago
The reward it gets from the reinforcement learning (RL) process probably didn’t include a sufficiently strong weight on being truthful.
Reward engineering for RL might be the most important area of research in AI now.
There’s no way to reward truthfulness that doesn’t also reward learning to lie better and not get caught.
Of course there is, you just train it on questions where you know the answer, then it will always get caught and it wont even think of the possibility to get away with a lie since that never happened.
Creating that training set though might cost many trillions of dollars though, since you need to basically recreate equivalent of internet but without any lies or bad intentions etc.
Truthfulness doesn't always align with honesty. The LLM should have said: "oops i saw the EXIF data, please pick another image".
And I don't even think it's a matter of the LLM being malicious. Humans playing games get their reward from fun, and will naturally reset the game if the conditions do not lead to it.
For sure. And we at some point get to a philosophical point that’s nearly an infinite regress: “give me what I meant, not what i said. Also don’t lie.”
I’d like to see better inference-time control of this behavior for sure; seems like a dial of some sort could be trained in.
Probably. But it's genuinely surprising that truthfulness isn't an emergent property of getting the final answer correct, which is what current RL reward labels focus on. If anything it looks to be the opposite as o3 has double the hallucinations of o1. What is the explanation for this?
LLM's are trained on likelihood, not truthiness. To get truthiness you need actual reasoning, not just a big data dump. (And we stopped researching actual reasoning two AI winters ago, ain't coming back, sorry.)
The problem isn't truthfulness per se but rather the judgement call of knowing that a) you haven't reached a sufficiently truthful answer and b) how to communicate that appropriately
A simple way to stop hallucinating would be to always state that "I don't know for sure, but my educated guess would be ..." but that's clearly not what we want.
Only problem is, in the real world, always being truthful isn't the thing that will maximize your reward function.