Comment by xianshou
6 months ago
Crazy how simple the technique is if this holds up. Just <think> and <reflection> plus synthetic data, used to finetune Llama 3.1 70B.
Note that there's a threshold for how smart the model has to be to take advantage of this flow (https://x.com/mattshumer_/status/1831775436420083753) - 8B is too dumb.
In which case, what happens if you apply this to a GPT-4o finetune, or to Claude 3.5 Sonnet?
What happens if you combine it with variants of tree-based reasoning? With AlphaProof (https://www.nature.com/articles/s41586-023-06747-5#Sec3)? With MCTSr (https://arxiv.org/abs/2406.07394)?
I was just thinking - since GPT-4o and Sonnet are closed models, do we know that this method was not already used to train them? And that Reflection is simply finding a path for greater improvements than they did. Llama 3.1 apparently didn't improve as much. It's just a thought though.
If they had, this thing wouldn't be trading punches with them at its size
Sonnet does something like this. See - https://tyingshoelaces.com/blog/forensic-analysis-sonnet-pro...
What parameter size are 4o and sonnet?