Comment by Nevermark

6 days ago

It would be very interesting to fine tune a model for a narrow task, while tracking its performance on every original training sample from the pre-tuning baseline.

I expect it would greatly help characterize what was lost, at the expense of a great deal of extra computation. But with enough experiments might shed some more general light.

I suspect the smaller the tuning dataset, the faster and worse the overwriting will be, since the new optimization surface will be so much simpler to navigate than the much bigger datasets optimization surface.

Then a question might be, what percentage of the original training data, randomly retained, might slow general degradation.