Comment by threethirtytwo
7 hours ago
You asked for something concrete, so I’ll anchor every claim to either documented results or directly observable training mechanics.
First, the claim that RLHF materially reduces hallucinations and increases factual accuracy is not anecdotal. It shows up quantitatively in benchmarks designed to measure this exact thing, such as TruthfulQA, Natural Questions, and fact verification datasets like FEVER. Base models and RL-tuned models share the same architecture and almost identical weights, yet the RL-tuned versions score substantially higher. These benchmarks are external to the reward model and can be run independently.
Second, the reinforcement signal itself does not contain factual information. This is a property of how RLHF works. Human raters provide preference comparisons or scores, and the reward model outputs a single scalar. There are no facts, explanations, or world models being injected. From an information perspective, this signal has extremely low bandwidth compared to pretraining.
Third, the scale difference is documented by every group that has published training details. Pretraining consumes trillions of tokens. RLHF uses on the order of tens or hundreds of thousands of human judgments. Even generous estimates put it well under one percent of the total training signal. This is not controversial.
Fourth, the improvement generalizes beyond the reward distribution. RL-tuned models perform better on prompts, domains, and benchmarks that were not part of the preference data and are evaluated automatically rather than by humans. If this were a Clever Hans effect or evaluator bias, performance would collapse when the reward model is not in the loop. It does not.
Fifth, the gains are not confined to a single definition of “truth.” They appear simultaneously in question answering accuracy, contradiction detection, multi-step reasoning, tool use success, and agent task completion rates. These are different evaluation mechanisms. The only common factor is that the model must internally distinguish correct from incorrect world states.
Finally, reinforcement learning cannot plausibly inject new factual structure at scale. This follows from gradient dynamics. RLHF biases which internal activations are favored, it does not have the capacity to encode millions of correlated facts about the world when the signal itself contains none of that information. This is why the literature consistently frames RLHF as behavior shaping or alignment, not knowledge acquisition.
Given those facts, the conclusion is not rhetorical. If a tiny, low-bandwidth, non-factual signal produces large, general improvements in factual reliability, then the information enabling those improvements must already exist in the pretrained model. Reinforcement learning is selecting among latent representations, not creating them.
You can object to calling this “knowing the truth,” but that’s a semantic move, not a substantive one. A system that internally represents distinctions that reliably track true versus false statements across domains, and can be biased to express those distinctions more consistently, functionally encodes truth.
Your three alternatives don’t survive contact with this. Clever Hans fails because the effect generalizes. Measurement artifact fails because multiple independent metrics move together. Fraud fails because these results are reproduced across competing labs, companies, and open-source implementations.
If you think this is still wrong, the next step isn’t skepticism in the abstract. It’s to name a concrete alternative mechanism that is compatible with the documented training process and observed generalization. Without that, the position you’re defending isn’t cautious, it’s incoherent.
No comments yet
Contribute on Hacker News ↗