Comment by narush
3 days ago
Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!
Noting a few important points here:
1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!
(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)
Figure 6 which breaks-down the time spent doing different tasks is very informative -- it suggest: 15% less active coding 5% less testing, 8% less research and reading
4% more idle time 20% more AI interaction time
The 28% less coding/testing/research is why developers reported 20% less work. You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.
I think the AI skill-boost comes from having work flows that let you shave half that git-ops time, cut an extra 5% off coding, but cut the idle/waiting and do more prompting of parallel agents and a bit more testing then you really are a 2x dev.
> You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.
This is going to be interesting long-term. Realistically people don't spend anywhere close to 100% of time working and they take breaks after intense periods of work. So the real benefit calculation needs to include: outcome itself, time spent interacting with the app, overlap of tasks while agents are running, time spent doing work over a long period of time, any skill degradation, LLM skills, etc. It's going to take a long time before we have real answers to most of those, much less their interactions.
i just realized the figure is showing the time breakdown as a percentage of total time, it would be more useful to show absolute time (hours) for those side-by-side comparisons since the implied hours would boost the AI bars height by 18%
There's additional breakdown per-minute in the appendix -- see appendix E.4!
Thanks for the detailed reply! I need to spend a bunch more time with this I think - above was initial hunches from skimming the paper.
Sounds great. Looking forward to hearing more detailed thoughts -- my emails in the paper :)
Really interesting paper, and thanks for the followon points.
The over-optimism is indeed a really important takeaway, and agreed that it's not tool-dependent.
Were participants given time to customize their Cursor settings? In my experience tool/convention mismatch kills Cursor's productivity - once it gets going with a wrong library or doesn't use project's functions I will almost always reject code and re-prompt. But, especially for large projects, having a well-crafted repo prompt mitigates most of these issues.
With today's state of LLMs and Agents, it's still not good for all the tasks. It took me couple of weeks before being able to correctly adjust on what I can ask and what I can expect. As a result, I don't use Claude Code for everything and I think I'm able to better pick the right task and the right size of task to give it. These adjustment depends on what you are doing, the complexity of and the maturity of the project at play.
Very often, I have entire tasks that I can't offload to the Agent. I won't say I'm 20x more productive, it's probably more in the range of 15% to 20% (but I can't measure that obviously).
Using devs working in their own repository is certainly understandable, but it might also explain in part the results. Personally I barely use AI for my own code, while on the other hand when working on some one off script or unfamiliar code base, I get a lot more value from it.
Your next study should be very experienced devs working in new or early life repos where AI shines for refactoring and structured code suggestion, not to mention documentation and tests.
It’s much more useful getting something off the ground than maintaining a huge codebase.
Did each developer do a large enough mix of AI/non-AI tasks, in varying orders, that you have any hints in your data whether the "AI penalty" grew or shrunk over time?
You can see this analysis in the factor analysis of "Below-average use of AI tools" (C.2.7) in the paper [1], which we mark as an unclear effect.
TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
Thanks, that's great!
But: if all developers did 136 AI-assisted issues, why only analyze excluding the 1st 8, rather than, say, the first 68 (half)?
1 reply →