← Back to context

Comment by wat10000

13 days ago

That would be strange. There's no hidden memory or data channel, the "thinking" output is all the model receives afterwards. If it's all nonsense, then nonsense is all it gets. I wouldn't be completely surprised if a context with a bunch of apparent nonsense still helps somehow, LLMs are weird, but it would be odd.

This isn't quite right. Even when an LLM generates meaningless tokens, its internal state continues to evolve. Each new token triggers a fresh pass through the network, with attention over the KV cache, allowing the model to refine its contextual representation. The specific tokens may be gibberish, but the underlying computation can still reflect ongoing "thinking".

Attention operates entirely on hidden memory, in the sense that it usually isn't exposed to the end user. An attention head on one thinking token can attend to one thing and the same attention head on the next thinking token can attend to something entirely different, and the next layer can combine the two values, maybe on the second thinking token, maybe much later. So even nonsense filler can create space for intermediate computation to happen.

Wasn't there some study that just telling the LLM to write a bunch of periods first improves responses?

  • There are several such papers, off the top of my head one is https://news.ycombinator.com/item?id=44288049

    • Although thinking a bit more, even constrained to only output dots, there can still some amount of information passing between each token, namely in the hidden states. The attention block N layers deep will compute attention scores off of the residual stream for previous inputs at that layer, so some information can be passed along this way.

      It's not very efficient though, because for token i layer N can only receive as input layer N-1 for tokens i-1, i-2... So information is sort of passed along diagonally. If handwavily the embedding represents some "partial result" then it can be passed along diagonally from (N-1, i-1) to (N, i) to have the COT for token i+1 continue to work on it. So this way even though the total circuit depth is still bounded by # of layers, it's clearly "more powerful" than just naively going from layer 1...n, because during the other steps you can maybe work on something else.

      But it's still not as powerful as allowing the results at layer n to be fed back in, which effectively unrolls the depth. This maybe intuitively justifies the results in the paper (I think it also has some connection to communication complexity).

Eh. The embeddings themselves could act like hidden layer activations and encode some useful information.