Comment by zozbot234

1 month ago

Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.

1 comment

zozbot234

Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.