Comment by wlib
1 month ago
The filler tokens actually do make them think more. Even just allowing the models to output "." until they are confident enough to output something increases their performance. Of course, training the model to do this (use pause tokens) on purpose works too: https://arxiv.org/pdf/2310.02226
OK that effect is super interesting, though if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question, it just conditions them to generate a better output when it actually decides to spit out a non-pause answer. If you condition them to generate pauses, they aren't really "thinking" about the problem while they generate pauses, they are just learning to generate pauses and do the actual thinking only at the last step when non-pause output is generated, utilizing the additional pathways.
If however there were a way to keep passing hidden states to future autoregressive steps and not just the final tokens from the previous step, that might give the model true "thinking" time.
> if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question
The layout of the NN is actually quite complex, which a large amount of information calculate beside the token-themselves, and the weights (think "latent vectors").
I recommend the 3b1b youtube-series on the topic.