← Back to context

Comment by mdp2021

2 months ago

Why do you suspect dLLMs should not match (or surpass) arLLMs in quality? The general idea is that it is easier to treat the output as a structured whole (idea, points, concepts, words - in a tree) which is iteratively treated - that should go in the direction of "proper" quality.

Another intuition is simply that anytime your causal relationships in the training data are sequential you are having a lower probability of getting the correct token at a certain position because you have less of the causal information leading up to that position than you would have with AR and thus during training you almost always have a worse model with near certainty (think of the words in a function of source code, even if some of the functions are unsorted and thus a tree at the high level). Imagine you somehow already have N tokens in a sequence: is it easier to next predict token N+1 or N+15? I do like the performance tradeoff for some usecases though and I hope we see more models soon. For image tokens my argument does not hold because causality is not as clear as for text, math, code, or timeseries.

My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.