Comment by tw1984

5 months ago

is there any public info on why such DeepSeek R1 + claude-3-5 combo worked better than using a single model?

Sonnet 3.5 is the best non-Chain-of-Thought code-authoring model. When paired with R1's CoT output, Sonnet 3.5 performs even better - outperforming vanilla R1 (and eveything else), which suggests Sonnet is better than R1 at utilizing R1's CoT.

It's scenario where the result is greater than the sum of it's parts

From my experiments with the Deepseek Qwen-32b distill model, the Deepseek model did not follow the edit instructions - the format was wrong. I know the distill models are not at all the same as the full model, but that could provide a clue. Combine that information with the scores, then you have a reasonable hypothesis.

  • > I know the distill models are not at all the same as the full model

    It's far worse than that. It's not the model (Deepseek) at all. It's Qwen enhanced with Deepseek. So it's Qwen still.

My personal experience is that R1 is smarter than 3.5 sonnet, but 3.5 sonnet is a better coder. Thus it may be better to let R1 to tackle the problem, but let 3.5 sonnet to implement the solution.

  • Specialization of AI models is cool. Just like some people might be better planners and some are better at raw coding ability.