Comment by tw1984

3 months ago

really love your dual standard mate!

according to the SWE bench results I am looking at, KIMI K2 has higher agentic coding score than Gemini and its gap with Claude Haiku 4.5 is just 71.3% vs 73.3%, that 2% difference is actually less than the 3% gap between GPT 5.1 (76.3%) vs Claude Haiku 4.5. interestingly, Gemini and Claude Haiku 4.5 are "frontier" according to you but KIMI K2, which actually has the higest HLE nd Live Codebench results, is just "near" the frontier.

1 comment

tw1984

sholain 3 months ago

You started by saying 'There's no way to judge!' - but then bring out 'Benchmarks!' ... and hypocritically infer that I have 'dual standards'?

The snark and ad hominem really undermine your case.

I won't descend to the level of calling other people names, or their arguments 'A Joke', or use 'It's Common Sense!' as a rhetorical device ...

But I will say that it's unreasonable to imply that Kimi, Qwen etc are 'Frontier Models'.

They are pretty good, and narrowly achieve some good scores on some benchmarks - but they're not broadly consistent at that Tier 1 quality.

They don't have the extended fine tuning which makes them better for many applications, especially coding, nor do they have the extended, non-LLM architecture components that further elevate their usefulness.

Nobody would choose Qwen for coding if they could have Sonnet at the same price and terms.

We use Qwen sometimes because it's 'cheap and good' not because it's 'great'.

The 'true coding benchmark' is that developers would chose Sonnet over Qwen, 99 out of 100 times, which is the difference between 'Tier 1' and 'Not Really Tier 1.

Finally, I run benchmarks with my team and I see in a pretty granular way what's going on.

What I've said above lines up with reality of our benchmarks.

We're looking at deploying with GLM/Z.ai - but not because it's the best model.

Google, OAI and Anthropic score consistently better - the issue is 'cost' and the fact that we can overcome the limitations of GLM. So 'it's good enough'.

That 'real world business case' best characterizes the overall situation.