Comment by jareds

2 months ago

I'll look at it when this shows up on https://aider.chat/docs/leaderboards/ I feel like keeping up with all the models is a full time job so I just use this instead and hopefully get 90% of the benefit I would by manually testing out every model.

5 comments

jareds

evantbyrne 2 months ago

Are these just leetcode exercises? What I would like to see is an independent benchmark based on real tasks in codebases of varying size.

rafram 2 months ago
Aider uses a dataset of 500 GitHub issues, so not LeetCode-style work.
- evantbyrne 2 months ago
  
  It says right on that linked page:
  > Aider’s polyglot benchmark tests LLMs on 225 challenging Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust.
  I looked up Exercism and they appear to be story problems that you solve by coding on mostly/entirely blank slates, unless I'm missing something? That format would seem to explain why the models are reportedly performing so well, because they definitely aren't that reliable on mature codebases.
KaoruAoiShiho 2 months ago

Aider is not just leetcode exercises I think? livecodebench is leetcode exercises though.