← Back to context

Comment by kristopolous

12 hours ago

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

  score  age  size   name
  62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  59.1   55   -      GPT-5.5 (xhigh)
  58.5   55   -      GPT-5.5 (high)
  57.2   104  -      GPT-5.4 (xhigh)
  56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  56.2   55   -      GPT-5.5 (medium)
  55.5   118  -      Gemini 3.1 Pro Preview
  53.1   132  -      GPT-5.3 Codex (xhigh)
  53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  52.1   55   -      GPT-5.5 (low)
  51.5   92   -      GPT-5.4 mini (xhigh)
  50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  50.7   1    large  GLM-5.2 (max)
  50.1   29   -      Qwen3.7 Max
  48.7   188  -      GPT-5.2 (xhigh)
  48.6   55   -      GPT-5.5 (Non-reasoning)
  48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  47.8   205  -      Claude Opus 4.5 (Reasoning)

  • Lol thank you for sorting.

    Are the scores here normalized such that each point difference is equidistant?

  •   rank  score  age  size   name
      1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
      2     59.1   55   -      GPT-5.5 (xhigh)
      3     58.5   55   -      GPT-5.5 (high)
      4     57.2   104  -      GPT-5.4 (xhigh)
      5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
      6     55.5   118  -      Gemini 3.1 Pro Preview
      7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
      8     53.1   132  -      GPT-5.3 Codex (xhigh)
      9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
      10    51.5   92   -      GPT-5.4 mini (xhigh)
      11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
      12    50.7   1    large  GLM-5.2 (max)
      13    50.1   29   -      Qwen3.7 Max
      14    48.7   188  -      GPT-5.2 (xhigh)
      15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
      16    47.8   205  -      Claude Opus 4.5 (Reasoning)
      17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
      18    47.5   70   -      Muse Spark
      19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
      20    47.1   58   large  Kimi K2.6
      21    47.1   29   -      Gemini 3.5 Flash (minimal)
      22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
      23    46.5   211  -      Gemini 3 Pro Preview (high)
      24    46.5   16   -      Qwen3.7 Plus
      25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
      26    45.6   5    large  Kimi K2.7 Code
      27    45.6   104  -      GPT-5.4 (low)
      28    45.5   56   large  MiMo-V2.5-Pro
      29    45.1   43   -      GPT-5.5 Instant (May 2026)
      30    45.0   29   -      Gemini 3.5 Flash (high)
      31    44.9   58   -      Qwen3.6 Max Preview
      32    44.7   216  -      GPT-5.1 (high)
      33    44.2   188  -      GPT-5.2 (medium)
      34    44.2   126  large  GLM-5 (Reasoning)
      35    43.9   92   -      GPT-5.4 nano (xhigh)
      36    43.4   71   large  GLM-5.1 (Reasoning)
      37    43.4   16   large  MiniMax-M3
      38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
      39    43.0   188  -      GPT-5.2 Codex (xhigh)
      40    42.9   76   -      Qwen3.6 Plus
      41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
      42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
      43    42.2   99   -      Grok 4.20 0309 (Reasoning)
      44    42.1   56   large  MiMo-V2.5
      45    41.9   91   large  MiniMax-M2.7
      46    41.4   91   -      MiMo-V2-Pro
      47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
      48    41.0   48   -      Grok 4.3 (high)
      49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
      50    40.5   342  -      Grok 4
      51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)
    
    

    A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.

    • My observations:

      Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.

      Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.

      GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)

      DeepSeek is clearly the best value for $$. With the right harness and prompting.

  • Short comments...

    - GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...

    - China is going to eat the US lunch on AI

    - What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.

    - If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?

    • None of these models come from universities, European or otherwise.

      Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.

      The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.

      So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.

      9 replies →

    • To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.

      ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.

      It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.

      7 replies →

    • They did muse spark ... it's not garbage.

      Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...

      2 replies →

    • Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.

    • "…Anthropic Marketeer strike force…"

      Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.

      2 replies →

    • I also get the downvotes for the GPT thing, and agree with you about 5.5's quality, but TBH I don't think it's Anthropic marketing as just two other things:

      1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.

      2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.

      3. Most people I talk to haven't even tried Codex for some reason

      Also it's uncool to complain about downvotes.

    • I downvoted you for your complaining about downvotes fwiw.

      And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.

      As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.

Note that AA's coding index is only made up of two benchmarks: Terminal-Bench Hard and SciCode. I'm skeptical that it makes a good coding index. It ranks Gemma 4 31B above Deepseek V4 Flash. Having used both of those models for a broad variety of coding tasks I would choose Deepseek every day.

Consider using decrementing score order (best on top)

Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.

Seems legit. My experiments with GLM-5.2 so far have resulted in strange hallucinations in the tiniest of places. Like a wrong variable name.

It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.

Thanks for sharing. I'm curious: why didn't you sort with the score descending?

  • Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?

  • Not OP but if you run this from the CLI it does make the ordering make a little more sense

  • Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.

    Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.

    Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.