← Back to context

Comment by yawnxyz

1 month ago

> 好的,用户发来消息:“hello do you speak english” (Hunyuan-T1 thinking response)

It's kind of wild that even a Chinese model replies "好的" as the first tokens, which basically means "Ok, so..." like R1 and the other models respond. Is this RL'ed or just somehow a natural effect of the training?

If anything I feel like “Ok, so…” is wasted tokens so you’d think RL that incentivizes more concise thought chains would eliminate it. Maybe it’s actually useful in compelling the subsequent text to be more helpful or insightful.

  • There was a paper[1] from last year where the authors discovered getting the model to output anything during times of uncertainty, improved the generations overall. If all of the post-training alignment reasoning starts with the same tokens then I could see how it would condition the model to continue the reasoning phase.

    1: https://arxiv.org/abs/2404.15758

    • this is probably because the thinking tokens have the opportunity to store higher level/summarized contextual reasoning (lookup table based associations) in those token's KV caches. so an "Ok so" in position X may contain summarization vibes that are distinct from that in position Y.

  • > “Ok, so…” is wasted tokens

    This is not the case -- it's actually the opposite. The more of these tokens it generates, the more thinking time it gets (very much like humans going "ummm" all the time.) (Loosely speaking) every token generated is an iteration through the model, updating (and refining) the KV cache state and further extending the context.

    If you look at how post-training works for logical questions, the preferred answers are front-loaded with "thinking tokens" -- they consistently perform better. So, if the question is "what is 1 + 1?", they're post-trained to prefer "1 + 1 is 2" as opposed to just "2".

    • > the more thinking time it gets

      That's not how LLMs work. These filler word tokens eat petaflops of compute and don't buy time for it to think.

      Unless they're doing some crazy speculative sampling pipeline where the smaller LLM is trained to generate filler words while instructing the pipeline to temporarily ignore the speculative predictions and generate full predictions from the larger LLM. That would be insane.

      7 replies →

  • Ok, so I'm thinking here that.. hmm... maybe.. just maybe... there is something that, kind of, steers the rest of the thought process into a, you know.. more open process? What do you think? What do I think?

    As opposed to the more literary authoritative prose from textbooks and papers where the model output from the get-go has to commit to a chain of thought. Some interesting relatively new results are that time spent on output tokens more or less linearly correspond to better inference quality so I guess this is a way to just achieve that.

    The tokens are inserted artificially in some inference models, so when the model wants to end the sentence, you switch over the end token with "hmmmm" and it will happily now continue.

  • > RL that incentivizes more concise thought chains

    this seems backwards. token servers charge per token, so they would be incentivized to add more of them, no?

Surprisingly, Gemini (Thinking) doesn't do that—it thinks very formally, as if it's already formed its response.