Comment by zozbot234 5 hours ago Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw. 1 comment zozbot234 Reply _0ffh 1 hour ago Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.https://arxiv.org/abs/2507.15857
_0ffh 1 hour ago Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.https://arxiv.org/abs/2507.15857
Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.
https://arxiv.org/abs/2507.15857