Comment by nico
2 months ago
> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait". It’ll then begin to second guess and double check its answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>")
I know some are really opposed to anthropomorphizing here, but this feels eerily similar to the way humans work, ie. if you just dedicate more time to analyzing and thinking about the task, you are more likely to find a better solution
It also feels analogous to navigating a tree, the more time you have to explore the nodes, the bigger the space you'll have covered, hence higher chance of getting a more optimal solution
At the same time, if you have "better intuition" (better training?), you might be able to find a good solution faster, without needing to think too much about it
What’s missing in that analogy is that humans tend to have a good hunch about when they have to think more and when they are “done”. LLMs seem to be missing a mechanism for that kind of awareness.
LLMs actually do have such hunch, they just don't utilize it. You can literally ask them "Would you do better if you started over?" and start over if answer is yes. This works.
https://arxiv.org/abs/2410.02725
Great observation. Maybe an additional “routing model” could be trained to predict when it’s better to think more vs just using the current result