Comment by juxtaposicion
6 months ago
Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.
Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.
The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.
Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.
https://huggingface.co/mattshumer/Reflection-70B says system prompt used is:
Also, only "smarter" models can use this flow, according to https://x.com/mattshumer_/status/1831775436420083753
> Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.
They may already implement this technique, we can't know.
Claude 3.5 does have some "thinking" ability - I've seen it pause and even say it was thinking before. Presumably this is just some output it decides not to show you.
THIS!!!!!!! People act like Claude and 4o are base models with no funny business behind the scenes, we don't know just how much additional prompt steps are going on for each queue, all we know is what the API or Chat interface dump out, what is happening behind that is anyones guess.. The thinking step and refinement steps likely do exist on all the major commercial models. It's such a big gain for a minimal expenditure of backend tokens, WTF wouldn't they be doing it to improve the outputs?
3 replies →
That's only in the web version, it's just that they prompt it to do some CoT in the antThinking XML tag, and hide the output from inside that tag in the UI.
3 replies →
I suspect GPT4o already has training for CoT. I've noticed it often responds by saying something like "let's break it down step by step". Or maybe it's the system prompt.
I am not sure, but you seem to be implying that the Reflection model is running through multiple rounds? If so, that is not what is happening here. The token generation is still linear next token prediction. It does not require multiple rounds to generate the chain of thought response. It does that in one query pass.
I have been testing the model for the last few hours and it does seem to be an improvement on LLAMA 3.1 upon which it is based. I have not tried to compare it to Claude or GPT4o because I don't expect a 70b model to outperform models of that class no matter how good it is. I would happy to be wrong though...
I had a similar idea[0], interesting to see that it actually works. The faster LLM workloads can be accelerated, the more ‘thinking’ the LLM can do before it emits a final answer.
[0]: https://news.ycombinator.com/item?id=41377042
Further than that, it feels like we could use constrained generation of outputs [0] to force the model to do X amount of output inside of a <thinking> BEFORE writing an <answer> tag. It might not always produce good results, but I'm curious what sort of effect it might have to convince models that they really should stop and think first.
[0]: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
Can we replicate this in other models without finetuning them ?
See: https://news.ycombinator.com/item?id=41460812
Apple infamously adds "DO NOT HALLUCINATE" to its prompts.
Huh ? Source please (this is fascinating)
1 reply →
what's our estimate of the cost to finetune this?
I don't know the cost, but they supposedly did all their work in 3 weeks based on something they said in this video: https://www.youtube.com/watch?v=5_m-kN64Exc