← Back to context

Comment by papersail

8 hours ago

  rank  score  age  size   name
  1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  2     59.1   55   -      GPT-5.5 (xhigh)
  3     58.5   55   -      GPT-5.5 (high)
  4     57.2   104  -      GPT-5.4 (xhigh)
  5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  6     55.5   118  -      Gemini 3.1 Pro Preview
  7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  8     53.1   132  -      GPT-5.3 Codex (xhigh)
  9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  10    51.5   92   -      GPT-5.4 mini (xhigh)
  11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  12    50.7   1    large  GLM-5.2 (max)
  13    50.1   29   -      Qwen3.7 Max
  14    48.7   188  -      GPT-5.2 (xhigh)
  15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  16    47.8   205  -      Claude Opus 4.5 (Reasoning)
  17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
  18    47.5   70   -      Muse Spark
  19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
  20    47.1   58   large  Kimi K2.6
  21    47.1   29   -      Gemini 3.5 Flash (minimal)
  22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
  23    46.5   211  -      Gemini 3 Pro Preview (high)
  24    46.5   16   -      Qwen3.7 Plus
  25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
  26    45.6   5    large  Kimi K2.7 Code
  27    45.6   104  -      GPT-5.4 (low)
  28    45.5   56   large  MiMo-V2.5-Pro
  29    45.1   43   -      GPT-5.5 Instant (May 2026)
  30    45.0   29   -      Gemini 3.5 Flash (high)
  31    44.9   58   -      Qwen3.6 Max Preview
  32    44.7   216  -      GPT-5.1 (high)
  33    44.2   188  -      GPT-5.2 (medium)
  34    44.2   126  large  GLM-5 (Reasoning)
  35    43.9   92   -      GPT-5.4 nano (xhigh)
  36    43.4   71   large  GLM-5.1 (Reasoning)
  37    43.4   16   large  MiniMax-M3
  38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
  39    43.0   188  -      GPT-5.2 Codex (xhigh)
  40    42.9   76   -      Qwen3.6 Plus
  41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
  42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
  43    42.2   99   -      Grok 4.20 0309 (Reasoning)
  44    42.1   56   large  MiMo-V2.5
  45    41.9   91   large  MiniMax-M2.7
  46    41.4   91   -      MiMo-V2-Pro
  47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
  48    41.0   48   -      Grok 4.3 (high)
  49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
  50    40.5   342  -      Grok 4
  51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)

A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.

My observations:

Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.

Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.

GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)

DeepSeek is clearly the best value for $$. With the right harness and prompting.