Comment by zwaps
1 day ago
Scale changes the performance of LLMs.
Sometimes, we go so far as to say there is "emergence" of qualitative differences. But really, this is not necessary (and not proven to actually occur).
What is true is that the performance of LLMs at OOD tasks changes with scale.
So no, it's not the same as solving a math problem.
If you scale the LLM, you have to scale the tasks.
Of course performance improves on the same tasks.
The researchers behind the submitted work chose a certain size and certain size problems, controlling everything. There is no reason to believe that their results won't generalize to larger or smaller models.
Of course, not for the input problems being held constant! That is as strawman.
> What is true is that the performance of LLMs at OOD tasks changes with scale.
If scaling alone guaranteed strong OOD generalization, we’d expect the largest models to consistently top OOD benchmarks but this isn’t the case. In practice, scaling primarily increases a model’s capacity to represent and exploit statistical relationships present in the training distribution. This reliably boosts in-distribution performance but yields limited gains on tasks that are distributionally distant from the training data, especially if the underlying dataset is unchanged. That’s why trillion parameter models trained on the same corpus may excel at tasks similar to those seen in training, but won’t necessarily show proportional improvements on genuinely novel OOD tasks.