Comment by kristopolous

12 hours ago

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

65 comments

kristopolous

papersail 11 hours ago

  score  age  size   name
  62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  59.1   55   -      GPT-5.5 (xhigh)
  58.5   55   -      GPT-5.5 (high)
  57.2   104  -      GPT-5.4 (xhigh)
  56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  56.2   55   -      GPT-5.5 (medium)
  55.5   118  -      Gemini 3.1 Pro Preview
  53.1   132  -      GPT-5.3 Codex (xhigh)
  53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  52.1   55   -      GPT-5.5 (low)
  51.5   92   -      GPT-5.4 mini (xhigh)
  50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  50.7   1    large  GLM-5.2 (max)
  50.1   29   -      Qwen3.7 Max
  48.7   188  -      GPT-5.2 (xhigh)
  48.6   55   -      GPT-5.5 (Non-reasoning)
  48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  47.8   205  -      Claude Opus 4.5 (Reasoning)

christoff12 8 hours ago

Lol thank you for sorting.
Are the scores here normalized such that each point difference is equidistant?

papersail 8 hours ago

  rank  score  age  size   name
  1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  2     59.1   55   -      GPT-5.5 (xhigh)
  3     58.5   55   -      GPT-5.5 (high)
  4     57.2   104  -      GPT-5.4 (xhigh)
  5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  6     55.5   118  -      Gemini 3.1 Pro Preview
  7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  8     53.1   132  -      GPT-5.3 Codex (xhigh)
  9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  10    51.5   92   -      GPT-5.4 mini (xhigh)
  11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  12    50.7   1    large  GLM-5.2 (max)
  13    50.1   29   -      Qwen3.7 Max
  14    48.7   188  -      GPT-5.2 (xhigh)
  15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  16    47.8   205  -      Claude Opus 4.5 (Reasoning)
  17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
  18    47.5   70   -      Muse Spark
  19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
  20    47.1   58   large  Kimi K2.6
  21    47.1   29   -      Gemini 3.5 Flash (minimal)
  22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
  23    46.5   211  -      Gemini 3 Pro Preview (high)
  24    46.5   16   -      Qwen3.7 Plus
  25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
  26    45.6   5    large  Kimi K2.7 Code
  27    45.6   104  -      GPT-5.4 (low)
  28    45.5   56   large  MiMo-V2.5-Pro
  29    45.1   43   -      GPT-5.5 Instant (May 2026)
  30    45.0   29   -      Gemini 3.5 Flash (high)
  31    44.9   58   -      Qwen3.6 Max Preview
  32    44.7   216  -      GPT-5.1 (high)
  33    44.2   188  -      GPT-5.2 (medium)
  34    44.2   126  large  GLM-5 (Reasoning)
  35    43.9   92   -      GPT-5.4 nano (xhigh)
  36    43.4   71   large  GLM-5.1 (Reasoning)
  37    43.4   16   large  MiniMax-M3
  38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
  39    43.0   188  -      GPT-5.2 Codex (xhigh)
  40    42.9   76   -      Qwen3.6 Plus
  41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
  42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
  43    42.2   99   -      Grok 4.20 0309 (Reasoning)
  44    42.1   56   large  MiMo-V2.5
  45    41.9   91   large  MiniMax-M2.7
  46    41.4   91   -      MiMo-V2-Pro
  47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
  48    41.0   48   -      Grok 4.3 (high)
  49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
  50    40.5   342  -      Grok 4
  51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)

A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.

matheusmoreira 4 hours ago

These results are amazing! I can't believe an open weight model rivals Opus 4.6, my most used model!
cmrdporcupine 8 hours ago

My observations:
Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.
Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.
GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)
DeepSeek is clearly the best value for $$. With the right harness and prompting.

tcp_handshaker 11 hours ago
Short comments...
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
- Certhas 11 hours ago
  
  None of these models come from universities, European or otherwise.
  Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
  The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
  So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
  
  9 replies →
- marcus_cemes 10 hours ago
  
  To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.
  ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
  It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
  
  7 replies →
- wunderlotus 5 hours ago
  
  > - If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
  Yes, if the premise was true but it’s not.
  https://opper.ai/ai-roundtable/questions/bbf5a4e9-204
  
  1 reply →
- kristopolous 11 hours ago
  
  They did muse spark ... it's not garbage.
  Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
  
  2 replies →
- applicative 10 hours ago
  
  > China is going to eat the US lunch on AI
  They will forever have superior weights?
  
  4 replies →
- ricardobayes 10 hours ago
  
  Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.
- JKCalhoun 10 hours ago
  
  "…Anthropic Marketeer strike force…"
  Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
  
  2 replies →
- cmrdporcupine 8 hours ago
  
  I also get the downvotes for the GPT thing, and agree with you about 5.5's quality, but TBH I don't think it's Anthropic marketing as just two other things:
  1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.
  2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.
  3. Most people I talk to haven't even tried Codex for some reason
  Also it's uncool to complain about downvotes.
- senordevnyc 10 hours ago
  
  I downvoted you for your complaining about downvotes fwiw.
  And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
  As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
bel8 10 hours ago
you left some models out like DeepSeek and Kimi, for example.
- kristopolous 10 hours ago
  
  It was a truncated output from the script to demonstrate what it does ...
  If you really want to see all of them:
  https://day50.dev/output.txt
  Or run the script
- ashenke 10 hours ago
  
  Because it's not in the top 20 in their benchmark, it's at #23

sosodev 5 hours ago

Note that AA's coding index is only made up of two benchmarks: Terminal-Bench Hard and SciCode. I'm skeptical that it makes a good coding index. It ranks Gemma 4 31B above Deepseek V4 Flash. Having used both of those models for a broad variety of coding tasks I would choose Deepseek every day.

alecco 11 hours ago

Consider using decrementing score order (best on top)

kristopolous 11 hours ago
then I'd have to scroll up over 500 lines after running it every time to see what I care about.
But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
- datadrivenangel 11 hours ago
  
  Have it print paginated or just top 10?
  
  1 reply →
spwa4 11 hours ago

[dead]

bodhi_mind 10 hours ago

Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.

jarjoura 3 hours ago

Seems legit. My experiments with GLM-5.2 so far have resulted in strange hallucinations in the tiniest of places. Like a wrong variable name.

It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.

slig 11 hours ago

Thanks for sharing. I'm curious: why didn't you sort with the score descending?

kristopolous 11 hours ago
Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?
- duckmysick 11 hours ago
  
  I do and that's why I pipe the output to `head -n 20` or use `LIMIT 20` in SQL.
  That aside, this is a good script you're running. Thanks.
  
  2 replies →
fridder 11 hours ago

Not OP but if you run this from the CLI it does make the ordering make a little more sense
snsnbsne 11 hours ago

Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.

drob518 7 hours ago

Maybe your script could sort based on score.

scrollop 9 hours ago

Would be interesting to see where gpt 5.5 pro extended is.

OkGoDoIt 4 hours ago

[dead]