← Back to context

Comment by btown

3 months ago

> far more cognizant of the last thing that the they want to say when they start

This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.

If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?

(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)

Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.