← Back to context

Comment by mountainriver

2 months ago

Diffusion is more than just speed. Early benchmarks show it better at reasoning and planning pound for pound compared to AR.

This is because it can edit and doesn’t suffer from early token bias.

This is a super interesting claim - can you point to these benchmarks?

AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.

  • > AR in general is critical for learning the right distribution

    Could you please clarify that?

    • Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data.

      AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique.

A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.