Comment by zozbot234
5 months ago
Okay, let's think this through step by step. Isn't 'reflection thinking' a pretty well known technique in the AI prompt field? So this model was supposed to be so much better... why, exactly? It makes very little sense to me. Is it just about separating the "reflections/chain of thoughts" from the "final output" via specific tags?
Even though this was a scam, it's somewhat plausible. You finetune on synthetic data with lots of common reasoning mistakes followed by self-correction. You also finetine on synthetic data without reasoning mistakes where the "reflection" says that everything is fine. The model then learns to recognize output with subtle mistakes/hallucinations due to having been trained to do that.
But wouldn't the model then also learn to make reasoning mistakes in the first place, where in some cases those mistakes could have been avoided by not training the model on incorrect reasoning?
Of course if all mistakes are corrected before the final output tokens this is fine, but I could see this method introducing new errors altogether.
Supposedly was not just prompted to use reflection, but fine tuned on synthetic data demonstrating how to use the <|thinking|> tokens to reason, what self correction looks like etc
The problem with LLMs is that they struggle to generalize out of distribution. By training the model on a sequence of semantically tagged steps, you allow the model to stay in the training distribution for a larger amount of prompts.
I don't think it is 100% a scam, as in, his technique does improve performance, since a lot of the benefits can be replicated by a system prompt, but the wild performance claims are probably completely fabricated.