Comment by gorgoiler

2 months ago

This feels just like telling a constraint satisfaction engine to backtrack and find a more optimal route through the graph. We saw this 25 years ago with engines like PROVERB doing directed backtracking, and with adversarial planning when automating competitive games.

Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?

Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?

What would a superior control approach be? It's not clear to me how to get an LLM to be an LLM if you're not doing stochastic next token prediction. Given that, the model itself is going to know best how to traverse its own concept space. The R1 chain of thought training encourages and develops exactly that capability. Still, you want that chain of thought to terminate and not navel gaze endlessly.

So how to externally prod it to think more when it does terminate? Replacing thought termination with a linguistic signifier of continued reasoning plus novel realization seems like a charmingly simple, principled, and general approach to continue to traverse concept space.

This is the difference between science and engineering. What they have done is engineering. If the result is 90% of the way there with barely any effort, its best to move on to something else that may be low hanging fruit than to spend time chasing that 10%.

Totally agreed this is not a solution we are looking for, in fact this is the only solution we have in our hands right now. It's a good step forward.