Comment by anotherpaulg

1 year ago

I ran a few experiments by adding 0, 1 or 2 "write better code" prompts to aider's benchmarking harness. I ran a modified version of aider's polyglot coding benchmark [0] with DeepSeek V3.

Here are the results:

        | Number of 
        | "write better code"
  Score | followup prompts
  ---------------------------
  27.6% | 0 (baseline)
  19.6% | 1
  11.1% | 2

It appears that blindly asking DeepSeek to "write better code" significantly harms its ability to solve the benchmark tasks. It turns working solutions into code that no longer passes the hidden test suite.

[0] https://aider.chat/docs/leaderboards/

2 comments

anotherpaulg

minimaxir 1 year ago

This is an interesting result but not surprising given that bugs might cause the suite to fail.

layer8 1 year ago

To be fair, you didn’t specify that the functional requirements should be maintained, you only asked for better code. ;)