Comment by mmaunder

20 hours ago

I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.

It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.

I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.

If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.

I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

  • Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

    I'm not saying it's bad, but it's definitely different than the others.

  • Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

    Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

  • It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

    • While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.

      I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.

    • I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.

  • Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

    • Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

      The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

      1 reply →

> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.

> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.

When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.

Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

  • I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.

    If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.

    I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.

    When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?

    • I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.

      You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt

Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.

  • And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.

    • No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.

      I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

  • > They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

    This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.

    • 10x more code output is 10x more review.

      We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.

yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex