I mean if this works, it usually means you're not using either LLM to the best of its ability to start.
If they actually inspected where the performance mismatch is between the two models individually, they'd probably find certain classes of mistakes each is making that can be fixed with a better prompt/CoT/workflow with the individual model.
For a given prompt, different families of models almost always have idiosyncratic gaps that need to be fixed because of the differences in post-training for instruction following.
That's also why LLM routers feel kind of silly: the right prompt for one model on a complex task is almost never the optimal prompt for the next model.
I always do this with o3, gemini 2.5, and opus 4 when brainstorming hard problems: copy each model’s response to the other two.
Iterate until they pat each other on the back :)
I mean if this works, it usually means you're not using either LLM to the best of its ability to start.
If they actually inspected where the performance mismatch is between the two models individually, they'd probably find certain classes of mistakes each is making that can be fixed with a better prompt/CoT/workflow with the individual model.
For a given prompt, different families of models almost always have idiosyncratic gaps that need to be fixed because of the differences in post-training for instruction following.
That's also why LLM routers feel kind of silly: the right prompt for one model on a complex task is almost never the optimal prompt for the next model.