← Back to context

Comment by spartanatreyu

5 months ago

Link to the original paper: https://arxiv.org/pdf/2502.12115

TL;DR:

They tested with programming tasks and manager's tasks.

The vast majority of tasks given require bugfixes.

Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.

The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)

Personally, I have other concerns:

- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through

- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.

- LLM use increases code churn in a code base

- Increased code churn is known to be bad the health of projects