Comment by rndphs

11 days ago

https://arxiv.org/pdf/1912.02292 "We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better." That is the first sentence of the abstract. The first graph shown in the paper backs it up.

Looking into it further, it seems that typical LLMs are in the first descent regime anyway though so my original point is not too relevant for them anyway it seems. Also it looks like the second descent region doesn't always reach a lower loss than the first, it appears to depend on other factors as well.