Comment by gertlabs
14 hours ago
Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).
Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).
Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.
Overall, the field of open weights models is looking fantastic. A new near-frontier release every week, it seems.
Comprehensive, difficult to game benchmarks at https://gertlabs.com/?mode=oneshot_coding
I'm looking at your table now - is there a reason why you don't include cost? If Opus 4.7 is the winner but costs e.g. 5x as much, that's important information.
We recently added cost (last week), so data is sparse. Check back in a few weeks and it will be represented somewhere on the homepage, probably in the Efficiency Chart at the bottom. We also plan to show model performance deviation over time after we collect more data.
I'm interested to hear about any other data representations you'd like to see, too. The goal is to convey the most important information as densely as possible, without too much clutter.
How would K2.6 compare to Sonnet 4.6 both price and performance wise?
In terms of raw token cost, I've seen a couple providers at (all prices in terms of Mtok) $0.95 input/$0.15 cache input/$5 output vs $3 input/$15 output for sonnet.
Task prices of courses will be more interesting - a dumber model may use more tokens to get to the same goal.
Can you add Qwen 3.6 max to the leaderboard?
We will as soon as API access is widely available. Once a model goes live, we typically have one-shot reasoning benchmarks up in ~8 hours and comprehensive agentic/combined benchmarks up after 24-48 hours. We're working on building relationships with each lab to have the results before launch.
wait why compare 2.6 to 2 instead of to 2.5?
Good question. We missed that release entirely. Our automated model checker only went live 2 months ago so they were manually curated prior to that. I'm adding it now. It'll be live in ~12 hours.
Surprised to see such variance per language
It's interesting; I can only speculate as to the underlying reason. When given enough time, models outperform in Rust/C++ in longer agentic tasks, and actually perform worst in Python. For tasks that aren't judged on code speed. https://gertlabs.com/?mode=agentic_coding