Comment by koningrobot

4 months ago

It goes further back than that. In 2014, Li Yao et al (https://arxiv.org/abs/1409.0585) drew an equivalence between autoregressive (next token prediction, roughly) generative models and generative stochastic networks (denoising autoencoders, the predecessor to difussion models). They argued that the parallel sampling style correctly approximates sequential sampling.

In my own work circa 2016 I used this approach in Counterpoint by Convolution (https://arxiv.org/abs/1903.07227), where we in turn argued that despite being an approximation, it leads to better results. Sadly being dressed up as an application paper, we weren't able to draw enough attention to get those sweet diffusion citations.