← Back to context

Comment by freetime2

2 days ago

Here is the the methodology of the study:

> To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years. Developers provide lists of real issues (246 total) that would be valuable to the repository—bug fixes, features, and refactors that would normally be part of their regular work. Then, we randomly assign each issue to either allow or disallow use of AI while working on the issue. When AI is allowed, developers can use any tools they choose (primarily Cursor Pro with Claude 3.5/3.7 Sonnet—frontier models at the time of the study); when disallowed, they work without generative AI assistance. Developers complete these tasks (which average two hours each) while recording their screens, then self-report the total implementation time they needed. We pay developers $150/hr as compensation for their participation in the study.

So it's a small sample size of 16 developers. And it sounds like different tasks were (randomly) assigned to the no-AI and with-AI groups - so the control group doesn't have the same tasks as the experimental group. I think this could lead to some pretty noisy data.

Interestingly - small sample size isn't in the list of objections that the auther includes under "Addressing Every Objection You Thought Of, And Some You Didn’t".

I do think it's an interesting study. But would want to see if the results could be reproduced before reading into it too much.

I think the productivity gains most people rave about are stuff like, I wanted to do X which isn't hard if you are experienced with library Y and library Y is pretty popular and the LLM did it perfectly first try!

I think that's where you get 10-20x. When you're working on niche stuff it's either not gonna work or work poorly.

For example right now I need to figure out why an ffmpeg filter doesn't do X thing smoothly, even though the C code is tiny for the filter and it's self contained.. Gemini refuses to add comments to the code. It just apologizes for not being able to add comments to 150 lines of code lol.

However for building an ffmpeg pipeline in python I was dumbfounded how fast I was prototyping stuff and building fairly complex filter chains which if I had to do by hand just by reading the docs it would've taken me a whole lot more time, effort and frustration but was a joy to figure out with Gemini.

So going back to the study, IMO it's flawed because by definition working on new features for open source projects wouldn't be the bread and butter of LLMs however most people aren't working on stuff like this, they're rewriting the same code that 10000 other people have written but with their own tiny little twist or whatever.

  • I really think they excel at greenfield work, and are “fine” at writing code for existing systems. When you are unfamiliar with a library or a pattern it’s a huge time saver.

The sample size isn't 16 developers, it's 246 issues.

  • So agree with that - but on the other hand surely the number of developers matters here? For example, if instead of 16 developers the study consisted of a single developer completing all 246 tasks with or without AI, and comparing the observed times to complete, I think most people would question the reproducibility and relevancy of the study?

    • It matters in the sense that it is unclear whether the findings generalise to other people. Which is a problem that a lot of studies, even with more participants, have because they may not have a diverse enough set of participants.

      But in terms of pure statistical validity, I don't think it matters.

  • Whilst my recent experience possibly agrees with the findings, I came here to moan about the methods. Whether it's 16 or 246, that's still a miserably small sample size.

  • Okay, so why not 246,000 issues?

    • If you read through the methodology, including how they paid the participants $150 / hr, for 20-40 hours work per participant, you can probably hazard a guess why they didn't scale up the size of the study by 1000x.