Comment by baegi
5 months ago
But wouldn't the model then also learn to make reasoning mistakes in the first place, where in some cases those mistakes could have been avoided by not training the model on incorrect reasoning?
Of course if all mistakes are corrected before the final output tokens this is fine, but I could see this method introducing new errors altogether.