Comment by baegi

10 months ago

But wouldn't the model then also learn to make reasoning mistakes in the first place, where in some cases those mistakes could have been avoided by not training the model on incorrect reasoning?

Of course if all mistakes are corrected before the final output tokens this is fine, but I could see this method introducing new errors altogether.