Comment by XCSme

18 hours ago

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

17 comments

XCSme

rmi_ 12 hours ago

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

XCSme 11 hours ago
The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.
- BoorishBears 11 hours ago
  
  > It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.
  Yuck. At that point don't publish a benchmark, explains why their results are useless too.
  -
  Edit since I'm not able to reply to the below comment:
  "I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.
  I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.
  
  2 replies →

usagisushi 15 hours ago

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

wizee 16 hours ago

It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

vidarh 3 hours ago

While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.
I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.
XCSme 16 hours ago

I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.
redoh 7 hours ago

we used Kimi 2.5, its really good

raincole 10 hours ago

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

scotty79 10 hours ago

GLM 5 here is significantly better than GPT-5.4

anonyggs 9 hours ago

[dead]

comboy 11 hours ago

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

XCSme 11 hours ago
Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).
The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.
- comboy 10 hours ago
  
  Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.
  I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.