Comment by jbellis
2 months ago
swe-bench's bigger problems include (1) labs train on the test and (2) 50% of the tickets are from django; it's not a representative dataset even if all you care about is Python.
I created a new benchmark from Java commits that are new in the past 6 months to add some variety: https://brokk.ai/power-ranking
No GLM?
no, I'm pretty skeptical that it's better than qwen3 coder
but if you have evidence that it could be, I'm down to test it
It has the same score on https://lmarena.ai/leaderboard/webdev , but AFAIK Air version is much smaller.
1 reply →