Comment by Mathnerd314
6 days ago
OK, so this intuition is actually a bit hard to unpack, I got it from bits and pieces. So this is this post https://www.fast.ai/posts/2023-09-04-learning-jumps/. Essentially, a single pass over the training data is enough for the LLM to significantly "learn" the material. In fact if you read the LLM training papers, for the large-large models, they generally explicitly say that they only did 1 pass over the training corpus, and sometimes not even the full corpus, only like 80% of it or whatever. The other relevant information is the loss curves - models like Llama 3 are not trained until the loss on the training data is minimized, like typical ML models. Rather they use these approximate estimates of FLOPS / tokens vs. performance on benchmarks. But it is pretty much guaranteed that if you continued to train on the training data it would continue to improve its fit - 1 pass over the training data is by no means enough to adequately learn all of the patterns. So from a compression standpoint, the paper I linked previously says that an LLM is a great compressor - but it's not even fully tuned, hence "not trained to saturation".
Now as far as how fine-tuning affects model performance, it is pretty simple: improves fit on the fine-tuning data, decreases fit on original training corpus. Beyond that, yeah, it is hard to say if fine-tuning will help you solve your problem. My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent.
Always appreciated the work of Jeremy Howard. Also had a lot of fun using the Fast.ai framework. My experience is similar to your description. When using 2, 3, or more epochs, felt that overfitting started to emerge. (And I was CERTAINLY not training models anywhere near the size of modern LLMs) I suppose in this case by “saturation” you meant training “marginally before exhibiting over-fitting” - something akin to “the elbow method” w.r.t. clustering algorithms? I’ll have to chew on your description of overfitting results for a while. It jives with mine, but in a way that really makes me question my own - thanks for the thought provoking response!