← Back to context

Comment by tw1984

3 days ago

> Nobody has access to 'frontier quality models' except Open AI, Anthropic, Google, maybe Grok, maybe Meta etc. aka nobody in China quite yet.

welcome to 2025. Meta doesn't have anything on par with what Chinese got, that is common knowledge. Kimi, GLM, QWen and MiniMax are all frontier models no matter how you judge it. DeepSeek is obviously cooking something big, you need to be totally blind to ignore that.

America's lead in LLM is just weeks, not quarters or years. Arguing that Chinese spy agencies have to rely on American coding agents to do its job is more like a joke.

Kimi is plausibly near the frontier but definitely not up to GPT5 spec, the rest are definitely not 'frontier models'.

There are objective ways of 'judging' them.

  • really love your dual standard mate!

    according to the SWE bench results I am looking at, KIMI K2 has higher agentic coding score than Gemini and its gap with Claude Haiku 4.5 is just 71.3% vs 73.3%, that 2% difference is actually less than the 3% gap between GPT 5.1 (76.3%) vs Claude Haiku 4.5. interestingly, Gemini and Claude Haiku 4.5 are "frontier" according to you but KIMI K2, which actually has the higest HLE nd Live Codebench results, is just "near" the frontier.

    • You started by saying 'There's no way to judge!' - but then bring out 'Benchmarks!' ... and hypocritically infer that I have 'dual standards'?

      The snark and ad hominem really undermine your case.

      I won't descend to the level of calling other people names, or their arguments 'A Joke', or use 'It's Common Sense!' as a rhetorical device ...

      But I will say that it's unreasonable to imply that Kimi, Qwen etc are 'Frontier Models'.

      They are pretty good, and narrowly achieve some good scores on some benchmarks - but they're not broadly consistent at that Tier 1 quality.

      They don't have the extended fine tuning which makes them better for many applications, especially coding, nor do they have the extended, non-LLM architecture components that further elevate their usefulness.

      Nobody would choose Qwen for coding if they could have Sonnet at the same price and terms.

      We use Qwen sometimes because it's 'cheap and good' not because it's 'great'.

      The 'true coding benchmark' is that developers would chose Sonnet over Qwen, 99 out of 100 times, which is the difference between 'Tier 1' and 'Not Really Tier 1.

      Finally, I run benchmarks with my team and I see in a pretty granular way what's going on.

      What I've said above lines up with reality of our benchmarks.

      We're looking at deploying with GLM/Z.ai - but not because it's the best model.

      Google, OAI and Anthropic score consistently better - the issue is 'cost' and the fact that we can overcome the limitations of GLM. So 'it's good enough'.

      That 'real world business case' best characterizes the overall situation.