Comment by kibwen
3 months ago
To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.
> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words
Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.
So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.
To take a simple example, let’s say we ask an autoregressive model a yes/no factual question like “is 1+1=2?”. Then, we force the LLM to start with the wrong answer “No, “ and continue decoding.
An autoregressive model can’t edit the past. If it happens to sample the wrong first token (or we force it to in this case), there’s no going back. Of course there can be many more complicated lines of thinking as well where backtracking would be nice.
“Reasoning” LLMs tack this on with reasoning tokens. But the issue with this is that the LLM has to attend to every incorrect, irrelevant line of thinking which is at a minimum a waste and likely confusing.
As an analogy, in HN I don’t need to attend to every comment under a post in order to generate my next word. I probably just care about the current thread from my comment up to the OP. Of course a model could learn that relationship but that’s a huge waste of compute.
Text diffusion solves the whole problem entirely by allowing the model to simply revise the “no” to a “yes”. Very simple.
That is precisely what autoregressive means. Perhaps you meant to write that modern LLMs are not strictly autoregressive?
I think they are distinguishing the mechanical process of generation from the way the idea exists. It’s the same as how a person can literally only speak one word at a time but the ideas might be nonlinear.
5 replies →
If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.
I’m not arguing about efficiency though ? Simply saying next token predictors cannot be thought of as actually just thinking about the next token with no long term plan.
19 replies →
It also doesn’t mean they’re doing it inefficiently.
3 replies →
You're right that there is long-term planning going on, but that doesn't contradict the fact that an autoregressive LLM does, in fact, literally generate words one at a time based on previously spoken words. Planning and action are different things.
There is some long term planning going on, but bad luck when sampling the next token can take the process out of rails, so it's not just an implementation detail.
Here's a blog post I liked that explains a connection: https://sander.ai/2024/09/02/spectral-autoregression.html
They call diffusion a form of "spectral autoregression", because it tends to first predict lower frequency features, and later predict higher frequency features.
I will very often write a message on slack, only to then edit it 5 times… Now I always feel like a diffusion model when I do that.
Coding feels like that to me as well.
You 100% do pronounce or write words one at a time sequentially.
But before starting your sentence, you internally formulate the gist of the sentence you're going to say.
Which is exactly what happens in LLMs latent space too before they start outputting the first token.
I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?
I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.
Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.
I think there is a wide range of ways to "turn something in the head into words", and sometimes you use the "this is the final point, work towards it" approach and sometimes you use the "not sure what will happen, lets just start talking and go wherever". Different approaches have different tradeoffs, and of course different people have different defaults.
I can confess to not always knowing where I'll end up when I start talking. Similarly, not every time I open my mouth it's just to start but sometimes I do have a goal and conclusion.
They're speaking literally. When talking to someone (or writing), you ultimately say the words in order (edits or corrections notwithstanding). If you look at the gifs of how the text is generated - I don't know of anyone that has ever written like that. Literally writing disconnected individual words of the actual draft ("during," "and," "the") in the middle of a sentence and then coming back and filling in the rest. Even speaking like that would be incredibly difficult.
Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.
3 replies →
It's just too far of an analogy, it starts in the familiar SWE tarpit of human brain = lim(n matmuls) as n => infinity.
Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?
Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.
Taking a step back to the start, we're wondering:
Do LLMs plan for token N + X, while purely working to output token N?
TL;DR: yes.
via https://www.anthropic.com/research/tracing-thoughts-language....
Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.
5 replies →
> far more cognizant of the last thing that the they want to say when they start
This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.
If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?
(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)
Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.
It must be the case that some smart people have studied how we think, right?
The first person experience of having a thought, to me, feels like I have the whole thought in my head, and then I imagine expressing it to somebody one word at a time. But it really feels like I’m reading out the existing thought.
Then, if I’m thinking hard, I go around a bit and argue against the thought that was expressed in my head (either because it is not a perfect representation of the actual underlying thought, or maybe because it turns out that thought was incorrect once I expressed it sequentially).
At least that’s what I think thinking feels like. But, I am just a guy thinking about my brain. Surely philosophers of the mind or something have queried this stuff with more rigor.
People don't come up with things their brain does.
Words rise from an abyss and are served to you, you have zero insight into their formation. If I tell you to think of an animal, one just appears in your "context", how it got there is unknown.
So really there is no argument to be made, because we still don't mechanistically understand how the brain works.
2 replies →
Like most people I jump back and forth when I speak, disclaiming, correcting, and appending to previous utterances. I do this even more when I write, eradicating entire sentences and even the ideas they contain, within paragraphs that which by the time they were finished the sentence seemed unnecessary or inconsistent.
I did it multiple times while writing this comment, and it is only four sentences. The previous sentence once said "two sentences," and after I added this statement it was changed to "four sentences."
For most serious texts I start with a tree outline, before I engage my literary skills.
>You 100% do pronounce or write words one at a time sequentially.
It's statements like these that make me wonder if I am the same species as everyone else. Quite often, I've picked adjectives and idioms first, and then fill in around them to form sentences. Often because there is some pun or wordplay, or just something that has a nice ring to it, and I want to lead my words in that direction. If you're only choosing them one at a time and sequentially, have you ever considered that you might just be a dimwit?
It's not like you don't see this happening all around you in others. Sure you can't read minds, but have you never once watched someone copyedit something they've written, where they move phrases and sentences around, where they switch out words for synonyms, and so on? There are at least dozens of fictional scenes in popular media, you must have seen one. You have to have noticed hints at some point in your life that this occurs. Please. Just tell me that you spoke hastily to score internet argument points, and that you don't believe this thing you've said.
All of that can can still be seen as a linear sequence of actions from the perspective of human I/O with the environment.
What happens in the black box of the human mind to determine the next word to write/say is exactly made irrelevant in this level of abstraction, as regardless how, it would still result in a linear sequence of actions as observed by the environment.
Are you able to pronounce multiple words in superposition at the same time? Are you able to write multiple words in superposition? Can you read the following sentence: "HWeolrllod!"
Clearly communication is sequential.
LLMs are not more sequential than your vocal chords or your hand writing. They also plan ahead before writing.
(Just to expand on that, it's true not just the for the first token. There's a lot of computation, including potentially planning ahead, before each token outputted.)
That's why saying "it's just predicting the next word", is a misguided take.
Interpretability research has found that Autoregressive LLMs also plan ahead what they are going to say.
The March 2025 blog post by Anthropic titled "Tracing the thoughts of a large language model"[1] is a great introduction to this research, showing how their language model activates features representing concepts that will eventually get connected at some later point as the output tokens are produced.
The associated paper[2] goes into a lot more detail, and includes interactive features that help illustrate how the model "thinks" ahead of time.
[1] https://www.anthropic.com/research/tracing-thoughts-language...
[2] https://transformer-circuits.pub/2025/attribution-graphs/bio...
This seems likely just from the simple fact that they can reliably generate contextually correct sentences in e.g. German Imperfekt.
And, to pick an example from the research, being able to generate output that rhymes. In fact, it's hard to see how you would produce anything that would be considered coherent text without some degree of planning ahead at some level of abstraction. If it was truly one token at a time without any regard for what comes next it would constantly 'paint itself into a corner' and be forced to produce nonsense (which, it seems, does still happen sometimes, but without any planning it would occur constantly).
I don't think you're wrong but I don't think your logic holds up here. If you have a literal translation like:
I have a hot dog _____
The word in the blank is not necessarily determined when the sentenced is started. Several verbs fit at the end and the LLM doesn't need to know which it's going to pick when it starts. Each word narrows down the possibilities:
I - Trillions Have - Billions a - millions hot - thousands dog - dozens _____ - Could be eaten, cooked, thrown, whatever.
If it chooses cooked at this point that doesn't necessarily mean that the LLM was going to do that when it chose "I" or "have"
1 reply →
It's actually true on many levels, if you think about is needed for generating syntactically and grammatically correct sentences, coherent text and working code.
1 reply →
That's why I'm very excited by Gemini diffusion[1].
- [1] https://deepmind.google/models/gemini-diffusion/
The fact that you’re cognitively aware is evidence that this is nowhere near diffusion. More like rumination or thinking tokens, if we absolutely had to find a present day LLM metaphor
It feels like a mix of both to me, diffusion "chunks" being generated in sequence. As I write this comment, I'm deciding on the next word while also shaping the next sentence, like turning a fuzzy idea into a clear sequence.
Maybe it's two different modes of thinking. I can have thoughts that coalesce from the ether, but also sometimes string a thought together linearly. Brains might be able to do both.
I feel completely the opposite way.
When you speak or do anything, you focus on what you’re going do next. Your next action. And at that moment you are relying on your recent memory, and things you have put in place while doing the overall activity (context).
In fact what’s actually missing from AI currently is simultaneous collaboration, like a group of people interacting — it is very 1 on 1 for now. Like human conversations.
Diffusion is like looking at a cloud and trying to find a pattern.
[dead]
LLMs are notoriously bad at reflecting on how they work and I feel like humans are probably in the same boat
That is not contrary to token-at-a-time approach.
> Speaking for myself, I don't generate words one a time based on previously spoken words
This is a common but fundamentally a weird assumption people have about neurology where they think that what they consciously perceive has some bearing on what's actually happening at the operational or physical level.