← Back to context

Comment by betula_ai

1 day ago

Thank you for this informative and thoughtful post. An interesting twist to the increasing error accumulation as autoregressive models generate more output, is the recent success of language diffusion models for predicting multiple tokens simultaneously. They have a remasking strategy at every step of the reviser process, that masks low confidence tokens. Regardless your observations perhaps still apply. https://arxiv.org/pdf/2502.09992

Thanks for bringing this up! As far as I understand it current text diffusion models are limited to fairly short context windows. The idea of a text diffusion model continuously updating and revising a million-token-long chain-of-thought is pretty mind-boggling. I agree that these non-autoregressive models could potentially behave in completely different ways.

That said, I'm pretty sure we're a long way from building equally-competent diffusion-based base models, let alone reasoning models.

If anyone's interested in this topic, here are some more foundational papers to take a look at:

- Simple and Effective Masked Diffusion Language Models [2024] (https://arxiv.org/abs/2406.07524)

- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [2023] (https://arxiv.org/abs/2310.16834)

- Diffusion-LM Improves Controllable Text Generation [2022] (https://arxiv.org/abs/2205.14217)