Comment by dragonwriter

2 years ago

It may mean that the evaluations useful range of distinguishing inprovements is limited. If its a 0-100 score on defined sets of tasks that were set because they were hard enough to distinguish quality in models a while back, the rapid rate of improvement may mean that they are no longer useful in distinguishing quality of current models even aside from the problem that it is increasingly hard to stop the actual test tasks from being reflected in training data in some form.