← Back to context

Comment by jbellis

8 months ago

swe-bench's bigger problems include (1) labs train on the test and (2) 50% of the tickets are from django; it's not a representative dataset even if all you care about is Python.

I created a new benchmark from Java commits that are new in the past 6 months to add some variety: https://brokk.ai/power-ranking

4 comments

jbellis

Reply

lostmsu 8 months ago

No GLM?

jbellis 8 months ago
no, I'm pretty skeptical that it's better than qwen3 coder
but if you have evidence that it could be, I'm down to test it
- lostmsu 8 months ago
  
  It has the same score on https://lmarena.ai/leaderboard/webdev , but AFAIK Air version is much smaller.
  
  1 reply →