← Back to context

Comment by Terr_

10 days ago

> One possible explanation here: as these get smarter, they lie more to satisfy requests.

I feel there's some kind of unfounded anthropomorphization in there.

In contrast, consider the framing:

1. A system with more resources is able to return more options that continue the story.

2. The probability of any option being false (when evaluated against the real world) is greater than it being true, and there are also more possible options that continue the story than ones which terminate it.

3. Therefore we get more "lies" because of probability and scale, rather than from humanoid characteristics.

That also is similar in a sense to a typical human bahavior of "rounding" a "logical" argument, and then building the next ones on top of that, rounding at each or at least many steps in succession and bacically ending up at arbitrary (or intended) conclusions.

This is hard to correct with a global training, as you would need to correct each step, even the most basic ones, instead. As it's hard to convince someone that their result is not correct, when you actually would have to show the errors in the steps that led there.

For LLMs it feels even more tricky when thinking about complex paths being encoded somehow dynamically in simple steps than if there was some clearer/deeper path that could be activated and corrected. Correcting one complex "truth" seems much more straightforward (sic) than effectively targeting those basic assumptions enough so that they won't build up to something strange again.

I wonder what effective ways exist to correct these reasoning models. Like activating the full context and then retraining the faulty steps, or even "overcorrecting" the most basic ones?

I see a sort of parallel in how search fuzzing has become so ubiquitous, because returning 0 results means you don't get any clicks. That sort of reward function means the fuzzing should get worse the fewer true results there are.