Comment by _0ffh
15 hours ago
You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.
Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.
Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.
https://arxiv.org/abs/2507.15857