Comment by papersail
13 hours ago
score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
Lol thank you for sorting.
Are the scores here normalized such that each point difference is equidistant?
A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.
These results are amazing! I can't believe an open weight model rivals Opus 4.6, my most used model!
My observations:
Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.
Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.
GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)
DeepSeek is clearly the best value for $$. With the right harness and prompting.
Short comments...
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
None of these models come from universities, European or otherwise.
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
Mistral have moved to actually trying to make money, and been relatively successful; at least if we lived in a normal world.
They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...
Apertus was built by universities in Switzerland. Although not frontier it is fully open.
[1] https://apertvs.ai/pages/about/
I'm actually more curious about IBM. Their granite series appears to be nowhere close to competitive.
They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time
Maybe it's good at something else?
6 replies →
To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.
ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
Sir, I would suggest that if Europe fails to be economically competitive, the downstream implications on European society will produce much worse outcomes than (for instance) data transparency…
Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.
3 replies →
If these models ever reach the point where they are as good a programmer as a human is (and thus can self-improve completely independently), then there won't be an independent Switzerland much longer. AI race is a race for first place.
> like with the nuclear arms race
MacArthur was about to nuke the Chinese in the Korean war. China knows that nuclear weapons, AI and robotics are a matter of survival and not a nice-to-have.
[flagged]
1 reply →
> - If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Yes, if the premise was true but it’s not.
https://opper.ai/ai-roundtable/questions/bbf5a4e9-204
Interesting...but this shows how dumb these AI are.
And they misunderstood nothing to show for as...literally nothing to show for. Yes not factually but he has nothing effectively not much that is competitive to show for so its literally true.
And had they been give this clarification then would have suddenly said: "Oh yes of course, you are absolutely right, you are correct on challenging me on that...."
They did muse spark ... it's not garbage.
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
Mo Bitar said something like "Meta's LLM is the one you use if you accidentially hit the wrong button in WhatsApp. Its user base is fat-finger phone users."
1 reply →
> China is going to eat the US lunch on AI
They will forever have superior weights?
I would imagine it will be a fundamental breakthrough, not weights alone, that are going to usher in the next generation of AI. Perhaps China will in fact make that breakthrough. They certainly seem to have a lot of eyeballs in the field right now.
3 replies →
Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.
"…Anthropic Marketeer strike force…"
Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
“Good will” is easier if OpenAI is your yardstick
1 reply →
I also get the downvotes for the GPT thing, and agree with you about 5.5's quality, but TBH I don't think it's Anthropic marketing as just two other things:
1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.
2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.
3. Most people I talk to haven't even tried Codex for some reason
Also it's uncool to complain about downvotes.
I downvoted you for your complaining about downvotes fwiw.
And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
you left some models out like DeepSeek and Kimi, for example.
It was a truncated output from the script to demonstrate what it does ...
If you really want to see all of them:
https://day50.dev/output.txt
Or run the script
Because it's not in the top 20 in their benchmark, it's at #23