Comment by mountainriver

2 months ago

Diffusion is more than just speed. Early benchmarks show it better at reasoning and planning pound for pound compared to AR.

This is because it can edit and doesn’t suffer from early token bias.

10 comments

mountainriver

martincsweiss 2 months ago

This is a super interesting claim - can you point to these benchmarks?

cubefox 2 months ago
https://deepmind.google/models/gemini-diffusion/#benchmarks
> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.
That doesn't necessarily mean that they scale as well as autoregressive models.
- jimmyl02 2 months ago
  
  I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale.
  With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.
  I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.
mdp2021 2 months ago

Try this one:
# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
https://dllm-reasoning.github.io/
mountainriver 2 months ago
https://github.com/HKUNLP/diffusion-vs-ar
- mdp2021 2 months ago
  
  I.e.: https://arxiv.org/html/2410.14157v3
  # Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

hansvm 2 months ago

AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.

mdp2021 2 months ago
> AR in general is critical for learning the right distribution
Could you please clarify that?
- hansvm 2 months ago
  
  Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data.
  AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique.

vessenes 2 months ago

A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.