Comment by zeroxfe
1 month ago
> “Ok, so…” is wasted tokens
This is not the case -- it's actually the opposite. The more of these tokens it generates, the more thinking time it gets (very much like humans going "ummm" all the time.) (Loosely speaking) every token generated is an iteration through the model, updating (and refining) the KV cache state and further extending the context.
If you look at how post-training works for logical questions, the preferred answers are front-loaded with "thinking tokens" -- they consistently perform better. So, if the question is "what is 1 + 1?", they're post-trained to prefer "1 + 1 is 2" as opposed to just "2".
> the more thinking time it gets
That's not how LLMs work. These filler word tokens eat petaflops of compute and don't buy time for it to think.
Unless they're doing some crazy speculative sampling pipeline where the smaller LLM is trained to generate filler words while instructing the pipeline to temporarily ignore the speculative predictions and generate full predictions from the larger LLM. That would be insane.
The filler tokens actually do make them think more. Even just allowing the models to output "." until they are confident enough to output something increases their performance. Of course, training the model to do this (use pause tokens) on purpose works too: https://arxiv.org/pdf/2310.02226
OK that effect is super interesting, though if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question, it just conditions them to generate a better output when it actually decides to spit out a non-pause answer. If you condition them to generate pauses, they aren't really "thinking" about the problem while they generate pauses, they are just learning to generate pauses and do the actual thinking only at the last step when non-pause output is generated, utilizing the additional pathways.
If however there were a way to keep passing hidden states to future autoregressive steps and not just the final tokens from the previous step, that might give the model true "thinking" time.
1 reply →
Each token requires the same amount of compute. To a very crude approximation, model performance scales with total compute applied to the task. It’s not absurd that producing more tokens before an answer improves performance, in a way that’s akin to giving the model more time (compute) to think.
It’s more like conditioning the posterior of a response on “Ok, so…” lets the model enter a better latent space for answering logically vs just spitting out a random token.
I don’t think you have an accurate understanding of how LLMs work.
https://arxiv.org/abs/2501.19393
These tokens DO extend the thinking time. We are talking about causal autoregressive language models, and so these tokens can be used to guide the generation.