← Back to context

Comment by zozbot234

13 hours ago

> We will run out of additional material to train on

This sounds a bit silly. More training will generally result in better modeling, even for a fixed amount of genuine original data. At current model sizes, it's essentially impossible to overfit to the training data so there's no reason why we should just "stop".

You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.

  • Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.

I'm just talking about text generated by human beings. You can keep retraining with more parameters on the same corpus

https://proceedings.mlr.press/v235/villalobos24a.html

  • > I'm just talking about text generated by human beings.

    That in itself is a goalpost shift from

    > > We will run out of additional material to train on

    Where it is implied "additional material" === "all data, human + synthetic"

    ------

    There's still some headroom left in the synthetic data playground, as cited in the paper linked:

    https://proceedings.mlr.press/v235/villalobos24a.html ( https://openreview.net/pdf?id=ViZcgDQjyG )

    "On the other hand, training on synthetic data has shown much promise in domains where model outputs are relatively easy to verify, such as mathematics, programming, and games (Yang et al., 2023; Liu et al., 2023; Haluptzok et al., 2023)."

    With the caveat that translating this success outside of these domains is hit-or-miss:

    "What is less clear is whether the usefulness of synthetic data will generalize to domains where output verification is more challenging, such as natural language."

    The main bottleneck for this area of the woods will be (X := how many additional domains can be made easily verifiable). So long as (the rate of X) >> (training absorption rate), the road can be extended for a while longer.