← Back to context

Comment by krackers

4 days ago

Except generating more tokens also effectively extends the computational power beyond the depth of the circuit, which is why chain of thought works in the first place. Even sampling only dummy tokens that don't convey anything still provides more computational power.

I mean, generating more tokens means you use more computing power, and there's som e evidence that not all of these filler words go to waste (esp since they are not really words, but vectors that can carry latent meaning), as models tend to become smarter when allowed to generate a lot of heeming and hawing.

It's been proven that this accidental computation is actually helping CoT models, but they're not supposed to work like that - they're supposed to generate logical observations and use said observations to work further towards the goal (and they primarily do do that).

Considering filler tokens occupy context space and are less useful than meaningful tokens, a model that tries to maximize useful results per amount of compute, you'd want a terse context window without any fluff.