Comment by rvz

5 months ago

The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.

Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.

If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?

For anyone who don't know what IOI and IMO refers to;

IOI refers to the International Olympiad in Informatics, a prestigious annual computer science competition for high school students, while IMO refers to the International Mathematical Olympiad, which is a world-renowned mathematics competition for pre-college students.

(Ironically, provided by ChatGPT)

The new Lancer benchmark is on actual problems and that is where it is failing by a huge margin.