Comment by jdlshore
7 days ago
This is self-reported productivity, in that devs are saying AI saves them about 4 hours per week. But let’s not forget the METR study that found a 20% increase in self-reported productivity but a 19% decrease in actual measured productivity.
(It used a clever and rigorous technique for measuring productivity differences, BTW, for anyone as skeptical of productivity measures as I am.)
Let's also not forget the multiple other studies that found significant boosts to productivity using rigorous methods like RCTs.
However, because these threads always go the same way whenever I post this, I'll link to a previous thread in hopes of preempting the same comments and advancing the discussion! https://getdx.com/uploads/ai-measurement-framework.pdf
It's not clear from TFA if these savings are self-reported or from DX metrics.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
That info is from mid 2025, talking about models released in Oct 2024 and Feb 2025. It predates tools like Claude Code and Codex, Lovable was 1/3 current ARR, etc.
This might still be true but we desperately need new data.
None of those changes address the issue jdlshore is pointing out: self assessed developers productivity increases from LLMs are not a reliable indication of actual productivity increases. It's true that modern LLMs might have less of a negative impact on productivity or increase it, but you won't be able to tell by asking developers if they feel more productive.
(Also, Anthropic released Claude Code in Febuary of 2025, which was near the start of the period the study ran).
> self assessed developers productivity increases from LLMs are not a reliable indication of actual productivity increases.
I believe the other direction makes more sense; if the studies disagree with self-reported information, it's more likely the studies are wrong. At the very least, it's worth heavily questioning whether the studies are wrong.
Yeah new data would be great, but i feel like these tools are not substantively better and this is becoming the new "its different this time!"
Counting "time per PR" is as useless as counting lines of code.
An industry I think we spend ~10% of our time writing code and ~90% of our time maintaining it and building upon it.
The real metric is not "how long did that PR take" but "how much additional work will this PR create or save in the long run." -- ie did this create tech debt? Or did it actually save us a bunch of effort in the long run?
My experience with ChatGPT these last few years is that if used "conscientiously" it allows me to ship much higher quality code because it has been very good at finding edge cases and suggesting optimizations. I am quite certain that when viewed over the long haul it has been at least a 2X productivity gain, possibly even much more, because all those edge cases and perf issues it solved for me in the initial PR represent many hours of work that will never have to be performed in the future.
It is of course possible to use AI coding assistants in other ways, producing AI slop that passes tests but is poorly structured and understood.
Has the METR study been replicated?
Not a scientific study, but someone did replicate the experiment on themselves [0] and found that in their case, any effect from LLM use wasn't detectable in their sample. Notably they almost certainly had more experience with LLMs than most of the METR participants did.
[0] https://mikelovesrobots.substack.com/p/wheres-the-shovelware...
I haven’t heard about any similar studies, no. I’m planning to conduct one at my workplace but we’re still deciding exactly which uses of AI to test.