Measuring the impact of AI on experienced open-source developer productivity

2 days ago (metr.org)

Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

  • I find the very popular response of "you're just not using it right" to be big copout for LLMs, especially at the scale we see today. It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user. Typically if a user doesn't find value in the product, we agree that the product is poorly designed/implemented, not that the user is bad. But AI seems somehow exempt from this sentiment

    • > It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

      It's completely normal in development. How many years of programming experience you need for almost any language? How many days/weeks you need to use debuggers effectively? How long from the first contact with version control until you get git?

      I think it's the opposite actually - it's common that new classes of tools in tech need experience to use well. Much less if you're moving to something different within the same class.

      23 replies →

    • >It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

      Is that perhaps because of the nature of the category of 'tech peoduct'. In other domains, this certainly isn't the case. Especially if the goal is to get the best result instead of the optimum output/effort balance.

      Musical instruments are a clear case where the best results are down to the user. Most crafts are similar. There is the proverb "A bad craftsman blames his tools" that highlights that there are entire fields where the skill of the user is considered to be the most important thing.

      When a product is aimed at as many people as the marketers can find, that focus on individual ability is lost and the product targets the lowest common denominator.

      They are easier to use, but less capable at their peak. I think of the state of LLMs analogous to home computing at a stage of development somewhere around Altair to TRS-80 level. These are the first ones on the scene, people are exploring what they are good for, how they work, and sometimes putting them to effective use in new and interesting ways. It's not unreasonable to expect a degree of expertise at this stage.

      The LLM equivalent of a Mac will come, plenty of people will attempt to make one before it's ready. There will be a few Apple Newtons along the way that will lead people to say the entire notion was foolhardy. Then someone will make it work. That's when you can expect to use something without expertise. We're not there yet.

    • > It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

      Maybe, but it isn't hard to think of developer tools where this is the case. This is the entire history of editor and IDE wars.

      Imagine running this same study design with vim. How well would you expect the not-previously-experienced developers to perform in such a study?

      12 replies →

    • New technologies that require new ways of thinking are always this way. "Google-fu" was literally a hirable career skill in 2004 because nobody knew how to search to get optimal outcomes. They've done alright improving things since then - let's see how good Cursor is in 10 years.

    • I think the reason for that is maybe you’re comparing to traditional products that are deterministic or have specific features that add value?

      If my phone keeps crashing or if the browser is slow or clunky then yes, it’s not on me, it’s the phone, but an LLM is a lot more open ended in what it can do. Unlike the phone example above where I expect it to work from a simple input (turning it on) or action (open browser, punch in a url), what an LLM does is more complex and nuanced.

      Even the same prompt from different users might result in different output - so there is more onus on the user to craft the right input.

      Perhaps that’s why AI is exempt for now.

    • Stay tuned, a new study is coming with another revelation: you aren't getting faster by using Vim when you are learning it.

      My previous employer didn't even allow me to use Vim until I learned it properly so it wouldn't affect my productivity. Why would using a cursor automatically make you better at something if it's just new to you and you are already an elite programmer according to this study?

      1 reply →

    • It's a specialist tool. You wouldn't be surprised that it took awhile for someone to take a big to get at typed programming, parallel programming, docker, IaaC, etc. either.

      We have 2 sibling teams, one the genAI devs and the other the regular GPU product devs. It is entirely unsurprising to me that the genAI developers are successfully using coding agents with long-running plans, while the GPU developers are still more at the level of chat-style back-and-forth.

      At the same time, everyone sees the potential, and just like other automation movements, are investing in themselves and the code base.

    • On the other hand if you don't use vim, emacs, and other spawns from hell, you get labeled a noob and nothing can ever be said about their terrible UX.

      I think we can be more open minded that an absolutely brand new technology (literally did not exist 3y ago) might require some amount of learning and adjusting, even for people who see themselves as an Einstein if only they wished to apply themselves.

      3 replies →

    • Not every tool can be figured out in a day (or a week or more). That doesn't mean that the tool is useless, or that the user is incapable.

    • I've spent the last 2 months trying to figure out how to utilize AI properly, and only in the last week do I feel that I've hit upon a workflow that's actually a force multiplier (vs divisor).

      1 reply →

    • > It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

      Sorry to be pedantic but this is really common in tech products: vim, emacs, any second-brain app, effectiveness of IDEs depending on learning its features, git, and more.

      1 reply →

    • Just a few examples: Bicycle. Car(driving). Airplane(piloting). Welder. CNC machine. CAD.

      All take quite an effort to master, until then they might slow one down or outright kill.

  • Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!

    Noting a few important points here:

    1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

    2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.

    3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!

    4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.

    5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.

    In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).

    I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!

    (You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)

    • Figure 6 which breaks-down the time spent doing different tasks is very informative -- it suggest: 15% less active coding 5% less testing, 8% less research and reading

      4% more idle time 20% more AI interaction time

      The 28% less coding/testing/research is why developers reported 20% less work. You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

      I think the AI skill-boost comes from having work flows that let you shave half that git-ops time, cut an extra 5% off coding, but cut the idle/waiting and do more prompting of parallel agents and a bit more testing then you really are a 2x dev.

      3 replies →

    • Thanks for the detailed reply! I need to spend a bunch more time with this I think - above was initial hunches from skimming the paper.

      1 reply →

    • Really interesting paper, and thanks for the followon points.

      The over-optimism is indeed a really important takeaway, and agreed that it's not tool-dependent.

    • With today's state of LLMs and Agents, it's still not good for all the tasks. It took me couple of weeks before being able to correctly adjust on what I can ask and what I can expect. As a result, I don't use Claude Code for everything and I think I'm able to better pick the right task and the right size of task to give it. These adjustment depends on what you are doing, the complexity of and the maturity of the project at play.

      Very often, I have entire tasks that I can't offload to the Agent. I won't say I'm 20x more productive, it's probably more in the range of 15% to 20% (but I can't measure that obviously).

    • Were participants given time to customize their Cursor settings? In my experience tool/convention mismatch kills Cursor's productivity - once it gets going with a wrong library or doesn't use project's functions I will almost always reject code and re-prompt. But, especially for large projects, having a well-crafted repo prompt mitigates most of these issues.

    • Using devs working in their own repository is certainly understandable, but it might also explain in part the results. Personally I barely use AI for my own code, while on the other hand when working on some one off script or unfamiliar code base, I get a lot more value from it.

    • Your next study should be very experienced devs working in new or early life repos where AI shines for refactoring and structured code suggestion, not to mention documentation and tests.

      It’s much more useful getting something off the ground than maintaining a huge codebase.

    • Did each developer do a large enough mix of AI/non-AI tasks, in varying orders, that you have any hints in your data whether the "AI penalty" grew or shrunk over time?

      3 replies →

  • > My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learning curve.

    Could be the case for some, but I also think, that there is not much to climb on the learning curve for AI agents.

    In my opinion, its more interesting, that the study also states, that AI capabilities may be comparatively lower on existing code:

    > Our results also suggest that AI capabilities may be comparatively lower in settings with very high quality standards, or with many implicit requirements (e.g. relating to documentation, testing coverage, or linting/formatting) that take humans substantial time to learn.

    This is consistent with my personal/pear experience. On existing code: You have to do try and error with AI until you get a 'good' result. Or highly modify AI generated code by yourself (which is often slower then writing it yourself from the beginning).

  • Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

    LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

    Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

    • Let me bring you a third (not necessarily true) interpretation:

      The developer who has experience using cursor saw a productivity increase not because he became better at using cursor, but because he became worse at not using it.

      8 replies →

    • > Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

      I would argue you don't need the "as a programming assistant" phrase as right now from my experience over the past 2 years, literally every single AI tool is massively oversold as to its utility. I've literally not seen a single one that delivers on what it's billed as capable of.

      They're useful, but right now they need a lot of handholding and I don't have time for that. Too much fact checking. If I want a tool I always have to double check, I was born with a memory so I'm already good there. I don't want to have to fact check my fact checker.

      LLMs are great at small tasks. The larger the single task is, or the more tasks you try to cram into one session, the worse they fall apart.

      1 reply →

    • > Current LLMs

      One thing that happened here is that they aren't using current LLMs:

      > Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

      That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

      62 replies →

    • The third option is that the person who used Cursor before had some sort of skill atrophy that led to lower unassisted speed.

      I think an easy measure to help identify why a slow down is happening would be to measure how much refactoring happened on the AI generated code. Often times it seems to be missing stuff like error handling, or adds in unnecessary stuff. Of course this assumes it even had a working solution in the first place.

    • > people consistently predict and self-report in the wrong direction

      I recall an adage about work-estimation: As chunks get too big, people unconsciously substitute "how possible does the final outcome feel" with "how long will the work take to do."

      People asked "how long did it take" could be substituting something else, such as "how alone did I feel while working on it."

      3 replies →

    • Or a sampling artifact. 4 vs 12 does seem significant within a study, but consider a set of N such studies.

      I assume that many large companies have tested efficiency gains and losses of there programmers much more extensively than the authors of this tiny study.

      A survey of companies and their evaluation and conclusions would carry more weight—-excluding companies selling AI products, of course.

      1 reply →

  • > My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

    I totally agree with this. Although also, you can end up in a bad spot even after you've gotten pretty good at getting the AI tools to give you good output, because you fail to learn the code you're producing well.

    A developer gets better at the code they're working on over time. An LLM gets worse.

    You can use an LLM to write a lot of code fast, but if you don't pay enough attention, you aren't getting any better at the code while the LLM is getting worse. This is why you can get like two months of greenfield work done in a weekend but then hit a brick wall - you didn't learn anything about the code that was written, and while the LLM started out producing reasonable code, it got worse until you have a ball of mud that neither the LLM nor you can effectively work on.

    So a really difficult skill in my mind is continually avoiding temptation to vibe. Take a whole week to do a month's worth of features, not a weekend to do two month's worth, and put in the effort to guide the LLM to keep producing clean code, and to be sure you know the code. You do want to know the code and you can't do that without putting in work yourself.

    • > Take a whole week to do a month's worth of features

      Everything else in your post is so reasonable and then you still somehow ended up suggesting that LLMs should be quadrupling our output

      6 replies →

    • So a really difficult skill in my mind is continually avoiding temptation to vibe.

      I agree. I have found that I can use agents most effectively by letting it write code in small steps. After each step I do review of the changes and polish it up (either by doing the fixups myself or prompting). I have found that this helps me understanding the code, but also avoids that the model gets in a bad solution space or produces unmaintainable code.

      I also think this kind of close-loop is necessary. Like yesterday I let an LLM write a relatively complex data structure. It got the implementation nearly correct, but was stuck, unable to find an off-by-one comparison. In this case it was easy to catch because I let it write property-based tests (which I had to fix up to work properly), but it's easy for things to slip through the cracks if you don't review carefully.

      (This is all using Cursor + Claude 4.)

    • I feel the same way. I use it for super small chunks, still understand everything it outputs, and often manually copy/paste or straight up write myself. I don't know if I'm actually faster before, but it feels more comfy than alt-tabbing to stack overflow, which is what I feel like it's mostly replaced.

      Poor stack overflow, it looks like they are the ones really hurting from all this.

    • > but then hit a brick wall

      This is my intuition as well. I had a teammate use a pretty good analogy today. He likened vibe coding to vacuuming up a string in four tries when it only takes one try to reach down and pick it up. I thought that aligned well with my experience with LLM assisted coding. We have to vacuum the floor while exercising the "difficult skill [of] continually avoiding temptation to vibe"

  • I notice that some people have become more productive thanks to AI tools, while others are not.

    My working hypothesis is that people who are fast at scanning lots of text (or code for that matter) have a serious advantage. Being able to dismiss unhelpful suggestions quickly and then iterating to get to helpful assistance is key.

    Being fast at scanning code correlates with seniority, but there are also senior developers who can write at a solid pace, but prefer to take their time to read and understand code thoroughly. I wouldn't assume that this kind of developer gains little profit from typical AI coding assistance. There are also juniors who can quickly read text, and possibly these have an advantage.

    A similar effect has been around with being able to quickly "Google" something. I wouldn't be surprised if this is the same trait at work.

    • Just to thank you for that point. I think it's likely more true than most of us realise. That and maybe the ability to mentally scaffold or outline a system or solution ahead of time.

    • An interesting point. I wonder how much my decades-old habit of watching subtitled anime helps there—it’s definitely made me dramatically faster at scanning text.

    • One has to take time to review code and think through different aspects of execution (like memory management, concurrency, etc). Plenty of code cannot be scanned.

      That said, if the language has GC and other helpers, it makes it easier to scan.

      Code and architecture review is an important part of my role and I catch issues that others miss because I spend more time. I did use AI for review (GPT 4.1), but only as an addition, since not reliable enough.

  • We have heard variations of that narrative for at least a year now. It is not hard to use these chatbots and no one who was very productive in open source before "AI" has any higher output now.

    Most people who subscribe to that narrative have some connection to "AI" money, but there might be some misguided believers as well.

  •   > My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
    

    This is what I heard about strong type systems (especially Haskell's) about 20-15 years ago.

    "History does not repeat, but it rhymes."

    If we rhyme "strong types will change the world" with "agentic LLMs will change the world," what do we get?

    My personal theory is that we will get the same: some people will get modest-to-substantial benefits there, but changes in the world will be small if noticeable at all.

    • I don't think that's a fair comparison. Type systems don't produce probabilistic output. Their entire purpose is to reduce the scope of possible errors you can write. They kind of did change the world, didn't they? I mean, not everyone is writing Haskell but Rust exists and it's doing pretty well. There was also not really a case to be made where type systems made software in general _worse_. But you could definitely make the case that LLM's might make software worse.

      2 replies →

    • Maybe it depends on the task. I’m 100% sure, that if you think that type system is a drawback, then you have never code in a diverse, large codebase. Our 1.5 million LOC 30 years old monolith would be completely unmaintainable without it. But seriously, anything without a formal type system above 10 LOC after a few years is unmaintainable. An informal is fine for a while, but not long for sure. On a 30 years old code, basically every single informal rules are broken.

      Also, my long experience is that even in PoC phase, using a type system adds almost zero extra time… of course if you know the type system, which should be trivial in any case after you’ve seen a few.

      3 replies →

  • I'm the developer of txtai, a fairly popular open-source project. I don't use any AI-generated code and it's not integrated into my workflows at the moment.

    AI has a lot of potential but it's way over-hyped right now. Listen to the people on the ground who are doing real work and building real projects, none of them are over-hyping it. It's mostly those who have tangentially used LLMs.

    It's also not surprising that many in this thread are clinging to a basic premise that it's 3 steps backwards to go 5 steps forward. Perhaps that is true but I'll take the study at face value, it seems very plausible to me.

  • My personal experience was that of a decrease in productivity until I spent significant time with it. Managing configurations, prompting it the right way, asking other models for code reviews… And I still see there is more I can unlock with more time learning the right interaction patterns.

    For nasty, legacy codebases there is only so much you can do IMO. With green field (in certain domains), I become more confident every day that coding will be reduced to an AI task. I’m learning how to be a product manager / ideas guy in response

  • Looking at the example tasks in the pdf ("Sentencize wrongly splits sentence with multiple...") these look like really discrete and well defined bug fixes. AI should smash tasks like that so this is even less hopeful.

  • I'm sympathetic to the argument re experience with the tools paying off, because my personal anecdata matches that. It hasn't been until the last 6 weeks, after watching a friend demo their workflow, that my personal efficiency has improved dramatically.

    The most useful thing of all would have been to have screen recordings of those 16 developers working on their assigned issues, so they could be reviewed for varying approaches to AI-assisted dev, and we could be done with this absurd debate once and for all.

  • I don't even think we know how to do it yet. I revise my whole attitude and all of my beliefs about this stuff every week: I figure out things that seemed really promising don't pan out, I find stuff that I kick myself for not realizing sooner, and it's still this high-stakes game. I still blow a couple of days and wish I had just done it the old-fashioned way, and then I'll catch a run where it's like, fuck, I was never that good, that's the last 5-10% that breaks a PB.

    I very much think that these things are going to wind up being massive amplifiers for people who were already extremely sophisticated and then put massive effort into optimizing them and combining them with other advanced techniques (formal methods, top-to-bottom performance orientation).

    I don't think this stuff is going to democratize software engineering at all, I think it's going to take the difficulty level so high that it's like back when Djikstra or Tony Hoare was a fairly typical computer programmer.

  • > My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

    Definitely. Effective LLM usage is not as straightforward as people believe. Two big things I see a lot of developers do when they share chats:

    1. Talk to the LLM like a human. Remember when internet search first came out, and people were literally "Asking Jeeves" in full natural language? Eventually people learned that you don't need to type, "What is the current weather in San Francisco?" because "san francisco weather" gave you the same, or better, results. Now we've come full circle and people talk to LLMs like humans again; not out of any advanced prompt engineering, but just because it's so anthropomorphized it feels natural. But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?" The LLM is also not insulted by you talking to it like this.

    2. Don't know when to stop using the LLM. Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually", they'll keep trying to prompt to get the LLM to generate what they want. Sometimes this works, but often it's just a waste of time and it's far more efficient to just take the LLM output and adjust it manually.

    Much like so-called Google-fu, LLM usage is a skill and people who don't know what they're doing are going to get substandard results.

    • > Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually"

      IMO 80% is way too much, LLMs are probably good for things that are not your domain knowledge and you can efford to not be 100% correct, like rendering the Mandelbrot set, simple functions like that.

      LLMs are not deterministic sometimes they produce correct code and other times they produce wrong code. This means one has to audit LLM generated code and auditing code takes more effort than writing it, especially if you are not the original author of the code being audited.

      Code has to be 100% deterministic. As programmers we write code, detailed instructions for the computer (CPU), we have developed allot of tools such as Unit Tests to make sure the computer does exactly what we wrote.

      A codebase has allot of context that you gain by writing the code, some things just look wrong and you know exactly why because you wrote the code, there is also allot of context that you should keep in your head as you write the code, context that you miss from simply prompting an LLM.

    • > Effective LLM usage is not as straightforward as people believe

      It is not as straightforward as people are told to believe!

      1 reply →

    • "But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?""

      How can you be so sure? Did you compare in a systematic way or read papers by people who did it?

      Now I surely get results giving the llm only snippets and keywords, but anything complex, I do notice differences the way I articulate. Not claiming there is a significant difference, but it seems to me this way.

      5 replies →

    • > But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?"

      While the results are going to be similar, typing a question in full can help you think about it yourself too, as if the LLM is a rubber duck that can respond back.

      I've found myself adjusting and rewriting prompts during the process of writing them before i ask the LLM anything because as i was writing the prompt i was thinking about the problem simultaneously.

      Of course for simple queries like "write me a function in C that calculates the length of a 3d vector using vec3 for type" you can write it like "c function vec3 length 3d" or something like that instead and the LLM will give more or less the same response (tried it with Devstral).

      But TBH to me that sounds like programmers using Vim claiming they're more productive than users of other editors because they have to use less keystrokes.

    • > Talk to the LLM like a human

      Maybe the LLM doesn't strictly need it, but typing out does bring some clarity for the asker. I've found it helps a lot to catch myself - what am I even wanting from this?

    • I'm not sure about your example about talking to LLMs. There is good reason to think that speaking to it like a human might produce better results, as that's what most of the training data is composed of.

      I don't have any studies, but it eems to me reasonable to assume.

      (Unlike google, where presumably it actually used keywords anyway)

      1 reply →

  • Thank you for the last paragraph.

    Same thought came when I was reading the article and glad I am not alone.

    Anecdotally, most common productivity boost is coming from cutting down weird slow steps in processes. Write an automation script, campaign previewer for marketing, etc etc.

    Coding seems to transform to be a more efficient (again anecdotally) but not entirely faster. You can do a better work on a new feature in the same or slightly smaller time.

    Idle time at 4% was interesting. I think this number goes higher the more you use a specific tool and adjust your workflow to that

  • "My intiution is that..." - AGREED.

    I've found that there are a couple of things you need to do to be very efficient.

    - Maintain an architecture.md file (with AI assistance) that answers many of the questions and clarifies a lot of the ambiguity in the design and structure of the code.

    - A bootstrap.md file(s) is also useful for a lot of tasks.. having the AI read it and start with a correct idea about the subject is useful and a time saver for a variety of kinds of tasks.

    - Regularly asking the AI to refactor code, simplify it, modularize it - this is what the experienced dev is for. VIBE coding generally doesn't work as AI's tend to write messy non-modular code unless you tell them otherwise. But if you review code, ask for specific changes.. they happily comply.

    - Read the code produced, and carefully review it. And notice and address areas where there are issues, have the AI fix all of these.

    - Take over when there are editing tasks you can do more efficiently.

    - Structure the solution/architecture in ways that you know the AI will work well with.. things it knows about.. it's general sweet spots.

    - Know when to stop using the AI and code it yourself.. particuarly when the AI has entered the confusion doom loop. Wasting time trying to get the AI to figure out what it's never going to is best used just fixing it yourself.

    - Know when to just not ever try to use AI. Intuitively you know there's just certain code you can't trust the AI to safely work on. Don't be a fool and break your software.

    ----

    I've found there's no guarantee that AI assistance will speed up any one project (and in some cases slow it down).. but measured cross all tasks and projects, the benefits are pretty substantial. That's probably others experience at this point too.

  • In addition to the learning curve of the tooling, there's also the learning curve of the models. Each have a certain personality that you have to figure out so that you can catch the failure patterns right away.

  • > A quarter of the participants saw increased performance, 3/4 saw reduced performance.

    The study used 246 tasks across 16 developers, for an average of 15 tasks per developer. Divide that further in half because tasks were assigned as AI or not-AI assisted, and the sample size per developer is still relatively small. Someone would have to take the time to review the statistics, but I don’t think this is a case where you can start inferring that the developers who benefited from AI were just better at using AI tools than those who were not.

    I do agree that it would be interesting to repeat a similar test on developers who have more AI tool assistance, but then there is a potential confounding effect that AI-enthusiastic developers could actually lose some of their practice in writing code without the tools.

    • > potential confounding effect that AI-enthusiastic developers could actually lose some of their practice in writing code without the tools

      I don't think this is a confounding effect

      This is something that we definitely need to measure and be aware of, if there is a risk of it

  • I can say that in my experience AI is very good at early codebases and refactoring tasks that come with that.

    But for very large stable codebases it is a mixed bag of results. Their selection of candidates is valid but it probably illustrates a worst case scenario for time based measurement.

    If an AI code editor cannot make more changes quicker than a dev or cannot provide relevant suggestions quick enough/without being distracting then you lose time.

  • I have been teaching people at my company how to use AI code tools, the learning curve is way worse for developers and I have had to come up with some exercises to try and breakthrough the curve. Some seemingly can’t get it.

    The short version is that devs want to give instructions instead of ask for what outcome they want. When it doesn’t follow the instructions, they double down by being more precise, the worst thing you can do. When non devs don’t get what they want, they add more detail to the description of the desired outcome.

    Once you get past the control problem, then you have a second set of issues for devs where the things that should be easy or hard don’t necessarily map to their mental model of what is easy or hard, so they get frustrated with the LLM when it can’t do something “easy.”

    Lastly, devs keep a shit load of context in their head - the project, what they are working on, application state, etc. and they need to do that for LLMs too, but you have to repeat themselves often and “be” the external memory for the LLM. Most devs I have taught hate that, they actually would rather have it the other way around where they get help with context and state but want to instruct the computer on their own.

    Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.

    • > Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.

      The CTO and VPEng at my company (very small, still do technical work occasionally) both love the agent stuff so much. Part of it for them is that it gives them the opportunity to do technical work again with the limited time they have. Without having to distract an actual dev, or spend a long time reading through the codebase, they can quickly get context for an build small items themselves.

    • > Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding

      This suggests me though that they are bad at coding, otherwise they would have stayed longer. And I can't find anything in your comment that would corroborate the opposite. So what gives?

      I am not saying what you say is untrue, but you didn't give any convincing arguments to us to believe otherwise.

      Also, you didn't define the criteria of getting better. Getting better in terms of what exactly???

      5 replies →

  • > My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

    Yes, and I'll add that there is likely no single "golden workflow" that works for everybody, and everybody needs to figure it out for themselves. It took me months to figure out how to be effective with these tools, and I doubt my approach will transfer over to others' situations.

    For instance, I'm working solo on smallish, research-y projects and I had the freedom to structure my code and workflows in a way that works best for me and the AI. Briefly: I follow an ad-hoc, pair-programming paradigm, fluidly switching between manual coding and AI-codegen depending on an instinctive evaluation of whether a prompt would be faster. This rapid manual-vs-prompt assessment is second nature to me now, but it took me a while to build that muscle.

    I've not worked with coding agents, but I doubt this approach will transfer over well to them.

    I've said it before, but this is technology that behaves like people, and so you have to approach it like working with a colleague, with all their quirks and fallibilities and potentially-unbound capabilities, rather than a deterministic, single-purpose tool.

    I'd love to see a follow-up of the study where they let the same developers get more familiar with AI-assisted coding for a few months and repeat the experiment.

    • > I've not worked with coding agents, but I doubt this approach will transfer over well to them.

      Actually, it works well so long as you tell them when you’ve made a change. Claude gets confused if things randomly change underneath it, but it has no trouble so long as you give it a short explanation.

  • Devil's advocate: it's also possible the one developer hasn't become more productive with Cursor, but rather has atrophied their non-AI productivity due to becoming reliant on Cursor.

    • I suspect you're onto something here but I also think it would be an extremely dramatic atrophy to have occurred in such a short period of time...

  • >My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

    Are we are still selling the "you are an expert senior developer" meme ? I can completely see how once you are working on a mature codebase LLMs would only slow you down. Especially one that was not created by an LLM and where you are the expert.

    • I think it depends on the kind of work you're doing, but I use it on mature codebases where I am the expert, and I heavily delegate to Claude Code. By being knowledgeable of the codebase, I know exactly how to specify a task I need performed. I set it to work on one task, then I monitor it while personally starting on other work.

      I think LLMs shine when you need to write a higher volume of code that extends a proven pattern, quickly explore experiments that require a lot of boilerplate, or have multiple smaller tasks that you can set multiple agents upon to parallelize. I've also had success in using LLMs to do a lot of external documentation research in order to integrate findings into code.

      If you are fine-tuning an algorithm or doing domain-expert-level tweaks that require a lot of contextual input-output expert analysis, then you're probably better off just coding on your own.

      Context engineering has been mentioned a lot lately, but it's not a meme. It's the real trick to successful LLM agent usage. Good context documentation, guides, and well-defined processes (just like with a human intern) will mean the difference between success and failure.

  • I feel like I get better at it as I use Claude code more because I both understand its strength and weaknesses and also understand what context it’s usually missing. Like today I was struggling to debug an issue and realised that Claude’s idea of a coordinate system was 90 degrees rotated from mine and thus it was getting confused because I was confusing it.

  • How were "experienced engineers" defined?

    I've found AI to be quite helpful in pointing me in the right direction when navigating an entirely new code-base.

    When it's code I already know like the back of my hand, it's not super helpful, other than maybe doing a few automated tasks like refactoring, where there have already been some good tools for a while.

    • > To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years.

  • Any "tricks" you learn for one model may not be applicable to another, it isn't a given that previous experience with a company's product will increase the likelihood of productivity increases. When models change out from under you, the heuristics you've built up might be useless.

  • It seems really surprising to me that anyone would call 50 hours of experience a "high skill ceiling".

  • i just treat ai as a very long auto complete. sometimes it surprises me. on things i do not know, like windows C calls, i think i ought to just search the documentation..

  • What you described has been true of the adoption of every technology ever

    Nothing new this time except for people who have no vision and no ability to work hard not “getting it” because they don’t have the cognitive capacity to learn

  • LLMs are good for things you know how to do, but can't be arsed to. Like small tools with extensive use of random APIs etc.

    For example I whipped together a Steam API -based tool that gets my game library and enriches it with data available in maybe 30 minutes of active work.

    The LLM (Cursor with Gemini Pro + Claude 3.7 at the time IIRC) spent maybe 2-3 hours on it while I watched some shows on my main display and it worked on my second screen with me directing it.

    Could I have done it myself from scratch like a proper artisan? Most definitely. Would I have bothered? Nope.

  • Simon's opinion is unsurprisingly that people need to read his blog and spam on every story on HN lest we be left behind.

  • A friend of mine, complete non-programmer, has been trying to use ChatGPT to write a phone app. I've been as hands off as I feel I can be, watching how the process goes for him. My observations so far is that it's not going well, he doesn't understand what questions he should be asking so the answers he's getting aren't useful. I encourage him to ask it to teach him the relevant programming but he asks it to help him make the app without programming at all.

    With more coaching from me, which I might end up doing, I think he would get further. But I expected the chatbot to get him further through the process than this. My conclusion so far is that this technology won't meaningfully shift the balance of programmers to non-programmers in the general population.

  • > My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

    You hit the nail on the head here.

    I feel like I’ve seen a lot of people trying to make strong arguments that AI coding assistants aren’t useful. As someone who uses and enjoys AI coding assistants, I don’t find this research angle to be… uh… very grounded in reality?

    Like, if you’re using these things, the fact that they are useful is pretty irrefutable. If one thinks there’s some sort of “productivity mirage” going on here, well OK, but to demonstrate that it might be better to start by acknowledging areas where they are useful, and show that your method explains the reality we’re seeing before using that method to show areas where we might be fooling ourselves.

    I can maybe buy that AI might not be useful for certain kinds of tasks or contexts. But I keep pushing their boundaries and they keep surprising me with how capable they are, so it feels like it’ll be difficult to prove otherwise in a durable fashion.

    • I think the thing is there IS a learning curve, AND there is a productivity mirage, AND they are immensely useful, AND it is context dependent. All of this leads to a lot of confusion when communicating with people who are having a different experience.

      2 replies →

    • Exactly. The people who say that these assistants are useless or "not good enough" are basically burying their heads in the sand. The people who claim that there is no mirage are burying their head in the sand as well...

It is 80/20 again - it gets you 80% of the way in 20% of the time and then you spend 80% of the time to get the rest of the 20% done. And since it always feels like it is almost there, sunk-cost fallacy comes into play as well and you just don't want to give up.

I think an approach that I tried recently is to use it as a friction remover instead of a solution provider. I do the programming but use it to remove pebbles such as that small bit of syntax I forgot, basically to keep up the velocity. However, I don't look at the wholesale code it offers. I think keeping the active thinking cap on results in code I actually understand while avoiding skill atrophy.

  • well we used to have a sort of inverse pareto where 80% of the work took 80% of the effort and the remaining 20% of the work also took 80% of the effort.

    I do think you're onto something with getting pebbles out of the road inasmuch as once I know what I need to do AI coding makes the doing much faster. Just yesterday I was playing around with removing things from a List object using the Java streams API and I kept running into ConcurrentOperationsExceptions, which happen when multiple threads are mutating the list object at the same time because no thread can guarantee it has the latest copy of the list unaltered by other threads. I spent about an hour trying to write a method that deep copies the list, makes the change and then returns the copy and running into all sorts of problems til I asked AI to build me a thread-safe list mutation method and it was like "Sure, this is how I'd do it but also the API you're working with already has a method that just....does this." Cases like this are where AI is supremely useful - intricate but well-defined problems.

    • > once I know what I need to do AI coding makes the doing much faster

      Most commenters on this paper seem to not respond to the strongest result from it. That is, the developers wrongly thought and felt that using AI had sped up their work. So we need to be super cautious about what we think we know.

    • Code reuse at scale: 80 + 80 = 160% ~ phi...coincidence?

      I think this may become a long horizon harvest for the rigorous OOP strategy, may Bill Joy be disproved.

      Gray goo may not [taste] like steel-cut oatmeal.

      2 replies →

  • I think it’s most useful when you basically need Stack Overflow on steroids: I basically know what I want to do but I’m not sure how to achieve it using this environment. It can also be helpful for debugging and rubber ducking generally.

    • All those things are true, but it's such a small part of my workflow at this point that the savings, while nice, aren't nearly as life-changing to my job as my CEO is forcing us to think it is.

      Once AI can actually untangle our 14 year old codebase full of hosh-posh code, read every commit message, JIRA ticket, and Slack conversation related to the changes in full context, it's not going to solve a lot of the hard problems at my job.

      1 reply →

    • > rubber ducking

      i don't mean to pick on your usage of this specifically, but i think it's noteworthy that the colloquial definition of "rubber ducking" seems to have expanded to include "using a software tool to generate advice/confirm hunches". I always understood the term to mean a personal process of talking through a problem out loud in order to methodically, explicitly understand a theoretical plan/process and expose gaps.

      based on a lot of articles/studies i've seen (admittedly haven't dug into them too deeply) it seems like the use of chatbots to perform this type of task actually has negative cognitive impacts on some groups of users - the opposite of the personal value i thought rubber-ducking was supposed to provide.

      3 replies →

    • The issue is that it is slow and verbose, at least in its default configuration. The amount of reading is non trivial. There’s a reason most references are dense.

      8 replies →

    • Absolutely this. For a while I was working with a language I was only partially familiar with, and I'd say "here's how I would do this in [primary language], rewrite it in [new language]" and I'd get a decent piece of code back. A little searching in the project to make sure it was stylistically correct and then done.

      1 reply →

  • > and then you spend 80% of the time to get the rest of the 20% done

    This was my pr-AI experience anyway, so getting that first chunk of time back is helpful.

    Related: One of the better takes I've seen on AI from an experienced developer was, "90% of my skills just became worthless, and the other 10% just became 1,000 times more valuable." There's some hyperbole there, I but I like the gist.

    • It’s not funny when you find yourself redoing the first 80%, as the only way to complete the second 80%.

    • Let us know if that dev you're talking about winds up working 90% less for the same amount, or earning 1000x more

      Otherwise he can shut the fuck up about being 1000x more valuable imo

  • 100% agreed. It is all about removing friction for me. Case in point: I would not have touched React in my previous career without the assist that LLMs now provide. The barrier to entry just _felt_ to be too large and one always has the instinct to stick with what one knows.

    However, it is _fun_ to go over the barrier if it is chatting with a model to get a quick tutorial and produce working code for a prototype (for your specific needs) where the understanding that you just developed is applied. The alternative (without LLMs) is to first do the ground work of learning via tutorials in text/video form and then do the cognitive mapping of applying the learning to one's prototype. I would make a lot of mistakes that expert/intermediate React developers don't make on this path.

    One could argue that it shortcuts some learning and perhaps the old way results in better retention. But, our field changes so fast... and when it remains static for too long, projects die. I think of all this as accelerant for progress in adoption of new ways of thinking about software and diffusing that more quickly across the developer population globally. Code is always fungible, anyway. The job is about all the other things that one needs to do besides coding.

  • Agreed and +1 on "always feels like it is almost there" leading to time sink. AI is especially good at making you feel like it's doing something useful; it takes a lot of skill to discern the truth.

  • It works great on adding stuff to an already established codebase. Things like “we have these search parameters, also add foo”. Remove anything related to x…

    • Exactly. If you can give it a contract and a context, essentially, and it doesn't need to write a large amount of code to fulfill it, it can be great.

      I just used it to write about 80 lines of new code like that, and there's no question it saves time.

  • As an old dev this is really all I want: a sort of autocorrect for my syntactical errors to save me a couple compile-edit cycles.

    • What I want is not autocorrect, because that won't teach me anything. I want it to yell at me loudly and point to the syntactical error.

      Autocorrect is a scourge of humanity.

  • The problem is I then have to also figure out the code it wrote to be able to complete the final 20%. I have no momentum and am starting from almost scratch mentally.

  • This is just not true in my experience. Not with the latest models. I routinely manage to 1-shot a whole "thing." e.g. yesterday I needed a Wordpress plugin for a single-time use to clean up a friend's site. I described exactly what I needed, it produced the code, it ran perfect first time and the UI looked like a million dollars. It got me 100% of the way in 0% of the time.

    I'm the biggest skeptic, but more and more I'm seeing it get me the bulk of the way with very little back-and-forth. If it was even more heavily integrated in my dev environment, it would save me even more time.

Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

  • Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.

  • (I read the post but not paper.)

    Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

  • Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

    If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

    If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

    • The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!

      That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]

      [1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

  • Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms

    Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect

    • Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!

      We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).

      Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.

      2 replies →

  • Does this reproduce for early/mid-career engineers who aren't at the top of their game?

    • How these results transfer to other settings is an excellent question. Previous literature would suggest speedup -- but I'd be excited to run a very similar methodology in those settings. It's already challenging as models + tools have changed!

Wow these are extremely interesting results, specially this part:

> This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I wonder what could explain such large difference between estimation/experience vs reality, any ideas?

Maybe our brains are measuring mental effort and distorting our experience of time?

  • Here's a scary thought, which I'm admittedly basing on absolutely nothing scientific:

    What if agentic coding sessions are triggering a similar dopamine feedback loop as social media apps? Obviously not to the same degree as social media apps, I mean coding for work is still "work"... but there's maybe some similarity in getting iterative solutions from the agent, triggering something in your brain each time, yes?

    If that was the case, wouldn't we expect developers to have an overly positive perception of AI because they're literally becoming addicted to it?

    • Like the feeling of the command line being always faster than using the GUI? Different ways we engage with a task can change our time perception.

      I wish there was a simple way to measure energy spent instead of time. Maybe nature is just optimizing for something else.

    • What if agentic coding results in _less_ dopamine than manual coding? Because honestly I think that's more likely and jives with my experience.

      There's no flow state to be achieved with AI tools (at the moment)

      1 reply →

    • This is fascinating and would go a long ways to explain why people seem to have totally different experiences with the same machines.

    • That's my suspicion to.

      My issue with this being a 'negative' thing is that I'm not sure it is. It works off of the same hunting / foraging instincts that keep us alive. If you feel addiction to something positive, it is bad?

      Social media is negative because it addicts you to mostly low quality filler content. Content that doesn't challenge you. You are reading shit posts instead of reading a book or doing something with better for you in the long run.

      One could argue that's true for AI, but I'm not confident enough to make such a statement.

      1 reply →

  • I would speculate that it's because there's been a huge concerted effort to make people want to believe that these tools are better than they are.

    The "economic experts" and "ml experts" are in many cases effectively the same group-- companies pushing AI coding tools have a vested interest in people believing they're more useful than they are. Executives take this at face value and broadly promise major wins. Economic experts take this at face value and use this for their forecasts.

    This propagates further, and now novices and casual individuals begin to believe in the hype. Eventually, as an experienced engineer it moves the "baseline" expectation much higher.

    Unfortunately this is very difficult to capture empirically.

  • I also wonder how many of the numerous AI proponents in HN comments are subject to the same effect. Unless they are truly measuring their own performance, is AI really making them more productive?

    • How would you even measure your own performance? You can go and redo something, forgetting everything you did along the way the first time

      4 replies →

  • > I wonder what could explain such large difference between estimation/experience vs reality, any ideas?

    This bit I wasn't at all surprised by, because this is _very common_. People who are doing a [magic thing] which they believe in often claim that it is improving things even where it empirically isn't; very, very common with fad diets and exercise regimens, say. You really can't trust subjects' claims of efficacy of something that's being tested on them, or that they're testing on themselves.

    And particularly for LLM tools, there is this strong sense amongst many fans that they are The Future, that anyone who doesn't get onboard is being Left Behind, and so forth. I'd assume a lot of users aren't thinking particularly rationally about them.

  • I think just about every developer hack turns out this way: static vs dynamic types; keyboard shortcuts vs mice; etc. But I think it’s also possible to over-interpret these findings: using the tools that make your work enjoyable has important second-order effects even if they aren’t the productivity silver bullet everyone claims they are.

  • It’s funny cause I sometimes have the opposite experience. I tried to use Claude code today to make a demo app to show off a small library I’m working on. I needed it to set up some very boilerplatey example app stuff.

    It was fun to watch, it’s super polished and sci-fi-esque. But after 15 minutes I felt braindead and was bored out of my mind lol

  • Part of it is that I feel I don't have to put as much mental energy into the coding part. I use my mental energy on the design and ideas, then kinda breeze through the coding now with AI at a much lower mental energy state than I would have when I was typing every single character of every line.

> developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with. The other is that it's tempting to time an AI with metrics like how long until the PR was opened or merged. But the AI workflow fundamentally shifts engineering hours so that a greater percentage of time is spent on refactoring, testing, and resolving issues later in the process, including after the code was initially approved and merged. I can see how it's easy for a developer to report that AI completed a task quickly because the PR was opened quickly, discounting the amount of future work that the PR created.

  • Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.

    Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).

    [1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

  • It's really hard to attribute productivity gains/losses to specific technologies or practices, I'm very wary of self-reported anecdotes in any direction precisely because it's so easy to fool ourselves.

    I'm not making any claim in either direction, the authors themselves recognize the study's limitations, I'm just trying to say that everyone should have far greater error bars. This technology is the weirdest shit I've seen in my lifetime, making deductions about productivity from anecdotes and dubious benchmarks is basically reading tea leaves.

  • > I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with.

    The standard experimental design that solves this is to randomly assign participants to the experiment group (with AI) and the control group (without AI), which is what they did. This isolates the variable (with or without AI), taking into account uncontrollable individual, context, and environmental differences. You don't need to know how the single individual and context would have behaved in the other group. With a large enough sample size and effect size, you can determine statistical significance, and that the with-or-without-AI variable was the only difference.

  • Figure 21 shows that initial implementation time (which I take to be time to PR) increased as well, although post-review time increased even more (but doesn't seem to have a significant impact on the total).

    But Figure 18 shows that time spent actively coding decreased (which might be where the feeling of a speed-up was coming from) and the gains were eaten up by time spent prompting, waiting for and then reviewing the AI output and generally being idle.

    So maybe it's not a good idea to use LLMs for tasks that you could've done yourself in under 5 minutes.

One thing I've experienced in trying to use LLMs to code in an existing large code base is that it's _extremely_ hard to accurately describe what you want to do. Oftentimes, you are working on a problem with a web of interactions all over the code and describing the problem to an LLM will take far longer than just doing it manually. This is not the case with generating new (boilerplate) code for projects, which is where users report the most favorable interaction with LLMs.

  • That’s my experience as well. It’s where Knuth comes in again: the program doesn’t just live in the code, but also in the minds of its creator. Unless I communicate all that context from the start, I can’t just dump years of concepts and strategy out of my brain into the LLM without missing details that would be relevant.

  • Hell, a lot of times I can't even explain an idea to my coworkers in a conversation, and I eventually say "I'll just explain it in code instead of words." And I just quickly put up a small PR that makes the skeleton of the changes (or even the entire changeset) and then we continue our conversation (or just do the review).

So far in my own hobby OSS projects, AI has only hampered things as code generation/scaffolding is probably the least of my concerns, whereas code review, community wrangling, etc. are more impactful. And AI tooling can only do so much.

But it's hampered me in the fact that others, uninvited, toss an AI code review tool at some of my open PRs, and that spits out a 2-page document with cute emoji and formatted bullet points going over all aspects of a 30 line PR.

Just adds to the noise, so now I spend time deleting or hiding those comments in PRs, which means I have even _less_ time for actual useful maintenance work. (Not that I have much already.)

As an open source maintainer on the brink of tech debt bankruptcy, I feel like AI is a savior, helping me keep up with rapid changes to dependencies, build systems, release methodology, and idioms.

IME AI coding is excellent for one-off scripts, personal automation tooling (I iterate on a tool to scrape receipts and submit expenses for my specific needs) and generally stuff that can be run in environments where the creator and the end user are effectively the same (and only) entity.

Scaled up slightly, we use it to build plenty of internal tooling in our video content production pipeline (syncing between encoding tools and a status dashboard for our non-technical content team).

Using it for anything more than boilerplate code, well-defined but tedious refactors, or quickly demonstrating how to use an unfamiliar API in production code, before a human takes a full pass at everything is something I'm going to be wary of for a long time.

This study neglects to incorporate the fact that I have forgotten how to write code.

  • Honestly, this is a fair point -- and speaks the difficulty of figuring out the right baseline to measure against here!

    If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.

    In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!

    • >If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.

      Wouldn't this be an underestimate, since without ai there'd be no forward progress at all? So ai-assisted is infinite speedup if the outputs are good.

  • I'm curious what space people are working in where AI does their job entirely.

    I can use it for parts of code, algorithms, error solving, and maybe sometimes a 'first draft'.

    But there is no way I could finish an entire piece of software with AI only.

So they paid developers 300 x 246 = about 73K just for developer recruitment for the study, which is not in any academic journal, or has no peer reviews? The underlying paper looks quite polished and not overtly AI generated so I don't want to say it entirely made up, but how were they even able to get funding for this?

  • Our largest funding was through The Audacious Project -- you can see an announcement here: https://metr.org/blog/2024-10-09-new-support-through-the-aud...

    Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate

  • >which is not in any academic journal, or has no peer reviews?

    As a philosopher who is into epistemology and ontology, I find this to be as abhorrent as religion.

    'Science' doesn't matter who publishes it. Science needs to be replicated.

    The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

    • > The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

      Specifically, it works as an example of a specific case where peer review doesn’t help as much. Peer review checks your arguments, not your data collection process (which the reviewer can’t audit for obvious reasons). It works fine in other scenarios.

      Peer review is unrelated to replication problems, except to the extent to which confused people expect peer review to fix totally unrelated replication problems.

    • Peer reviews are very important to filter out obviously low effort stuff.

      ...Or should I say "were" very important? With the help of today's GenAI every low effort stuff can look high effort without much extra effort.

  • Companies produce whitepapers all the time, right? They are typically some combination of technical report, policy suggestion, and advertisement for the organization.

  • Most of the world provides funding for research, the US used to provide funding but now that has been mostly gutted.

So slow until a learning curve is hit (or as one user posited "until you forget how to work without it").

But isn't the important thing to measure... how long does it take to debug the resulting code at 3AM when you get a PagerDuty alert?

Similarly... how about the quality of this code over time? It's taken a lot of effort to bring some of the code bases I work in into a more portable, less coupled, more concise state through the hard work of

- bringing shared business logic up into shared folders

- working to ensure call chains flow top down towards root then back up through exposed APIs from other modules as opposed to criss-crossing through the directory structure

- working to separate business logic from API logic from display logic

- working to provide encapsulation through the use of wrapper functions creating portability

- using techniques like dependency injection to decouple concepts allowing for easier testing

etc

So, do we end up with better code quality that ends up being more maintainable, extensible, portable, and composable? Or do we just end up with lots of poor quality code that eventually grows to become a tangled mess we spend 50% of our time fighting bugs on?

For me, the measurable gain in productiviy comes when I am working with a new language or new technology. If I were to use claude code to help implement a feature of a python library I've worked on for years then I don't think it would help much (Maybe even hurt). However, if I use claude code on some go code I have very little experience with, or using it to write/modify helm charts then I can definitely say it speeds me up.

But, taking a broader view its possible that these initial speed ups are negated by the fact that I never really learn go or helm charts as deeply now that I use claude code. Over time, its possible that my net productiviy is still reduced. Hard to say for sure, especially considering I might not have even considered talking these more difficult go library modifications if I didn't have claude code to hold my hand.

Regardless, these tools are out there, increasing in effectiveness and I do feel like I need to jump on the train before it leaves me at the station.

  • agreed - it helps transpose skills.

    That said, this comes up often in my office. It's just not giving really good advice in many situations - especially novel ones.

    AI is super good at coming up with things that have been written ad nauseam for coding-interview-prep website

This study focused on experienced OSS maintainers. Here is my personal experience, but a very different persona (or opposite to the one in the study). I always wanted to contribute to OSS but never had time to. Finally was able to do that, thanks to AI. Last month, I was able to contribute to 4 different repositories which I would never have dreamed of doing it. I was using an async coding agent I built[1], to generate PRs given a GitHub issue. Some PRs took a lot of back and forth. And some PRs were accepted as is. Without AI, there is no way I would have contributed to those repositories.

One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.

There are also a few PRs that never got accepted because the repro is not as strong or clear.

[1] https://workback.ai

  • Did you make the contributions though? Or did the LLM?

    This is not directed at you, but I am worried that contributors that use AI "exclusively" to contribute to OSS projects are extracting the value (street cred, being seen as part of the project community) without actually contributing anything (by being one more person that knows the codebase and can help steward it).

    It's the same thing we've seen out of enshittification of everything. Value extraction without giving back.

    Maybe I'm too much of a cynic. Maybe majority of OSS projects don't care. But I know I will be saddened if one of the OSS projects I care about get taken over by such "value extractors".

    • Did not take it personal. You brought up a good point.

      I've slightly alternate perspective. Imo, using OSS without contributing is the value extraction without giving back.

      If someone can fix a bunch of chores (that still take human time), with the use of AI (even though they don't become stewards), I still see it as giving back. Of course, there is a value chain - contributing with AI without understanding code is the bottom of value creation. Like you mentioned, also being a steward is the top of the value chain. Along the way, if the contributor builds some sorta reputation that would help with their career or other outcomes, so be it.

      So in that sense, I don't see it as enshittification. AI might make a pathway to resolve a bunch of things which otherwise wouldn't be resolved. In fact, this was the line of thinking for the tool we built. Instead of people making these mindless PRs, can we build an agent that can take care of 'trivial' tasks. I manually created PRs to test that hypothesis.

      There is also a natural self selection here. If someone was able to fix something without understanding any code, that is also indicative of how trivial the task is. There is a reverse effect to my argument though. These "AI contributors" can create a ton of PRs that would create a lot of work for maintainers to review them.

      In my case, I was being upfront about how I'm raising PRs and requesting permissions if it is OK to work on certain issue. Maintainers are quite open and inviting.

      1 reply →

I've been using LLMs almost every day for the past year. They're definitely helpful for small tasks, but in real, complex projects, reviewing and fixing their output can sometimes take more time than just writing the code myself.

We probably need a bit less wishful thinking. Blindly trusting what the AI suggests tends to backfire. The real challenge is figuring out where it actually helps, and where it quietly gets in the way.

I actually think that pasting questions into chatGPT etc. and then getting general answers to put into your code is the way.

“One shotting” apps, or even cursor and so forth seem like a waste of time. It feels like if you prompt it just right it might help but then it never really does.

  • I've done okay with copilot as a very smart autocomplete on: a) very typical codebase, with b) lots of boilerplate, where c) I'm not terribly familiar with the languages and frameworks, which are d) very, very popular but e) I don't really like, so I'm not particularly motivated to become familiar with them. I'm not a frontend developer, I don't like it, but I'm in a position now where I need to do frontend things with a verbose Typescript/React application which is not interesting from a technical point of view (good product, it's just not good because it has an interesting or demanding front end). Copilot (I use Emacs, so cursor is a non-starter, but copilot-mode works very well for Typescript) has been pretty invaluable to just sort of slogging through stuff.

    For everything else, I think you're right, and actually the dialog-oriented method is way better. If I learn an approach and apply some general example from ChatGPT, but I do the typing and implementation myself so I need to understand what I'm doing, I'm actually leveling up and I know what I'm finished with. If I weren't "experienced", I'd worry about what it was doing to my critical thinking skills, but I know enough about learning on my own at this point to know I'm doing something.

    I'm not interested in vibe coding at all--it seems like a one-way process to automate what was already not the hard part of software engineering; generating tutorial-level initial implementations. Just more scaffolding that eventually needs to be cleared away.

I guess, I am experienced open-source developer

(https://github.com/albumentations-team/Albumentations)

15k stars, 5 million monthly downloads

----

It may happen that Cursor in the agentic mode writes code slower than I am. But!

It frees me from being in the IDE 100% of the time.

There is infinite list of educational videos, blog posts, scientific papers, hacker news, twitter, reddit that I want to read and going through them, while agents do their job is ultra convenient.

=> If I think about "productivity" in a broader way => with Cursor + agents, my overall productivity moved to a whole another level.

It used to be that all you required to program was a computer and to RTFM but now we need to pay for API "tokens" and pray that there are no rug pull in the future.

  • "It used to be that all you required to write was a pen and paper but now we need to pay for 'electricity'..."

    You can still do those things.

This does not mention the open-source developer time wasted while reviewing vibe coded PRs

  • Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.

    There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!

    [1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

Very interesting methodology, but the sample size (16) is way too low. Would love to see this repeated with more participants.

  • Noting that most of our power comes from the number of tasks that developers complete; it's 246 total completed issues in the course of this study -- developers do about 15 issues (7.5 with AI and 7.5 without AI) on average.

    • Did you compare the variance within individuals (due to treatment) to the variance between individuals (due to other stuff)?

Something I don’t see mentioned that’s been helpful to me is having an agent add strict type safety to my typescript. I avoid the use of “any” type and berating an agent to “make it work” really opens my eyes and forces me to learn how advanced typescript can be leveraged. I feel that creating complex types that make my code simpler, makes autocomplete work(!), is a great tradeoff in some meta dimension of software dev.

As a project for work, I've been using Claude CLI all week to do as many tasks as possible. So with my week's experience, I'm now an expert in this subject and can weigh in.

Two things that stand out to me are 1. it depends a lot on what kind of task you are having the LLM do. and 2. if the LLM process takes more time, it is very likely your cognitive effort was still way less - for sysadmin kinds of tasks, working with less often accessed systems, LLMs can read --help, man pages, doc sites, all for you, and give you the working command right there (And then run it, and then look at the output and tell you why it failed, or how it worked, and what it did). There is absolutely no question that second part is a big deal. Sticking it onto my large open source project to fix a deep, esoteric issue or write some subtle documentation where it doesnt really "get" what I'm doing, yeah it is not as productive in that realm and you might want to skip it for the thinking part there. I think everyone is trying to figure out this question of "when and how" for LLMs. I think the sweet spot is for tasks involving systems and technologies where you'd otherwise be spending a lot of time googling, stackoverflowing, reading man pages to get just the right parameters into commands and so forth. This is cognitive grunt work and the LLMs can do that part very well.

My week of effort with it was not really "coding on my open source project"; two examples were, 1. running a bunch of ansible playbooks that I wrote years ago on a new host, where OS upgrades had lots of snags; I worked with Claude to debug all the various error messages and places where the newer OS distribution had different packages, missing packages, etc. it was ENORMOUSLY helpful since I never look at these playbooks and I dont even remember what I did, Claude can read it for you and interpret it as well as you can. 2. I got a bugzilla for a fedora package that I packaged years ago, where they have some change to the directives used in specfiles that everyone has to make. I look at fedora packaging workflows once every three years. I told Claude to read the BZ and just do it. IT DID IT. I had to get involved running the "mock" suite as it needed sudo but Claude gave me the commands. zero googling. zero even reading the new format of the specfile (the bz linked to a tool that does the conversion). From bug received to bug closed and I didnt do any typing at all outside of the prompt. Had it done before breakfast since I didnt even need any glucose for mental energy expended. This would have been a painful and frustrating mental effort otherwise.

so the studies have to get more nuanced and survey a lot more than 16 devs I think

  • > esoteric issue or write some subtle documentation where it doesnt really "get" what I'm doing, yeah it is not as productive in that realm and you might want to skip it for the thinking part there

    I've been sensing in these cases that i don't have a good enough way to express these problems, and that i actually need to figure that out, or the rest of my team, whether they're using AI or not, are gonna have a real hard time understanding the change i made.

It's been very helpful for me. I find ChatGPT the easiest to use; not because it's more accurate (it isn't), but because it seems to understand the intent of my questions most clearly. I don't usually have to iterate much.

I use it like a know-it-all personal assistant that I can ask any question to; even [especially] the embarrassing, "stupid" ones.

> The only stupid question is the one we don't ask.

- On an old art teacher's wall

My overall concern has to do with our developer ecosystem from the important points mentioned by simonw and narush. I've been concerned about this for years but AI reliance seems to be pouring jet fuel on the fire. Particularly troubling is the lack of understanding less-experienced devs will have over time. Does anyone have a counter-argument for this they can share on why this is a good thing?

  • The shallow analogy is like "why worry about not being able to do arithmetic without a calculator"? Like... the dev of the future just won't need it.

    I feel like programming has become increasingly specialized and even before AI tool explosion, it's way more possible to be ignorant of an enormous amount of "computing" than it used to be. I feel like a lot of "full stack" developers only understand things to the margin of their frameworks but above and below it they kind of barely know how a computer works or what different wire protocols actually are or what an OS might actually do at a lower level. Let alone the context in which in application sits beyond let's say, a level above a kubernetes pod and a kind of trial-end-error approach to poking at some YAML templates.

    Do we all need to know about processor architectures and microcode and L2 caches and paging and OS distributions and system software and installers and openssl engines and how to make sure you have the one that uses native instructions and TCP packets and envoy and controllers and raft systems and topic partitions and cloud IAM and CDN and DNS? Since that's not the case--nearly everyone has vast areas of ignorance yet still does a bunch of stuff--it's harder to sell the idea that whatever AI tools are doing that we lose skills in will somehow vaguely matter in the future.

    I kind of miss when you had to know a little of everything and it also seemed like "a little bit" was a bigger slice of what there was to know. Now you talk to people who use a different framework in your own language and you feel like you're talking to deep specialists whose concerns you can barely understand the existence of, let alone have an opinion on.

    • > Do we all need to know about processor architectures and microcode and L2 caches and paging and OS distributions and system software…

      Have you used modern software… or just software in general to be honest.

      We have had orders of magnitude improvement in hardware performance and much fewer orders of magnitude increase in software performance and features.

      May I present the windows start menu as a perfect exhibit, we put a web browser in there and made actually finding the software you want to use harder than ever, even search is completely broken 99% of the time (really try powertoys run or even windows + s for a night and day difference).

      We add boundless complexity to things that doesn’t need it, millions of lines of code, then waste millions of cycles running security tools to heuristically prevent malicious actors from exploiting our millions of lines of code that is impossible to know because it is deemed to difficult to learn the underlying semantics of the problem domain.

Ed Zitron was 100% right. The mask is off and the AI subprime crisis is coming. Reading TFA, it would be hilarious if the bubble burst AND it turns out there's actually no value to be had, at ANY price. I for one can't wait for this era of hype to end. We'll see.

you're addicted to the FEELING of productivity more than actual productivity. even knowing this, even seeing the data, even acknowledging the complete fuckery of it all, you're still gonna use me. i'm still gonna exist. you're all still gonna pretend this helps because the alternative is admitting you spent billions of dollars on spicy autocomplete.

I don't understand how anyone doing open source can use something trained on other people's code as a tool for contributions.

I wouldn't accept someone's copy and pasted code from another project if it were under an incompatible license, let alone something with unknown origin.

I would love to see a comparison of the pull requests generated by each workflow, if possible. My experience with Copilot has generally been that it suggests far more code than I would actually write to solve a specific problem - sometimes adding extra checks where they aren't needed, sometimes just being more verbose than I would be, and oftentimes repeating itself where it would be better to use an abstraction.

My personal hypothesis is that seeing the LLM write _so much_ code may create the feeling that the problems it is solving would take longer to solve by yourself.

AI by design can only repeat and recombine past material. Therefore actual invention is out.

  • Pretty much all invention is novel combination of known techniques. Anything that introduces a fundamental new technique is usually in the realm of groundbreaking papers and Nobel prizes.

  • its not a huge deal. i dont need the AI to invent a replacement to the for loop or map function; i only want it to use the those tools.

    I'm the one providing the invention, it's transforming my invention into an implementation; sometimes better than others.

  • Is that actually proven?

    • The easiest way to see this for yourself is with an image generator. Try asking for a very specific combination of things that would not normally appear together in an artpiece.

AI sometimes points out hygiene issues that may be swept under the carpet but once pointed out can't be ignored anymore. I know I don't need that error handling, I'm certain for the near future but maybe it is needed... Also the code produced by the AI has some impedance match with my natural code. Then one needs to figure out whether that is due to moving best practices, until now ignored best practices or the AI being overwhelmed with code from beginners. This all takes time - some of it is transient, some of it is actually improving things and some of it is waste. The jury is still out there.

I'll be curious of the long term impacts of AI.

Such as: do you end up spending more time to find and fix issues, does AI use reduce institutional knowledge, will you be more inclined to start projects over from scratch.

I find agents useful for showing me how to do something I don't already know how to do, but I could see how for tasks I'm an expert on, I'd be faster without an extra thing to have to worry about (the AI).

As someone has been doing hardcore genai for 2+ years, my experience has been, and what we advise internally:

* 3 weeks to transition from ai pairing to AI Delegation to ai multitasking. So work gains are mostly week 3+. That's 120+ hours in, as someone pretty senior here.

* Speedup is the wrong metric. Think throughput, not latency. Some finite amount of work might take longer, but the volume of work should go up because AI can do more on a task and diff tasks/projects in parallel.

Both perspectives seem consistent with the paper description...

  • Have you actually measured this?

    Because one of the big takeaways from this study is that people are bad at predicting and observing their own time spent.

    • yes, I keep prompt plan logs

      At the same time... that's not why I'm comfortable writing this. It's pretty obvious when you know what good vs bad feels like here and adjust accordingly:

      1. Good: You are able to generate a long plan and that plan mostly works. These are big wins _as long as you are multitasking_: you are high throughput, even if the AI is slow. Think running 5-20min at a time for pretty good progress, for just a few minutes of your planning that you'd largely have to do anyways.

      2. Bad: You are wasting a lot of attention chatting (so 1-2min runs) and repairing (re-planning from the top, vs progressing). There is no multitasking win.

      It's pretty clear what situation you're in, with run duration on its own being a ~10X level difference.

      Ex: I'll have ~3 projects going at the same time, and/or whatever else I'm doing. I'm not interacting "much" so I know it's a win. If a project is requiring interaction, well, now I need to jump in, and it's no longer agentic coding IMO, but chat assistant stuff.

      At the same time, I power through case #2 in practice because we're investing in AI automation. We're retooling everything to enable long runs, so we'll still do the "hard" tasks via AI to identify & smooth the bumps. Similar to infrastructure-as-code and SDLC tooling, we're investing in automating as much of our stack as we can, so that means we figure out prompt templates, CI tooling, etc to enable the AI to do these so we can benefit later.

      2 replies →

Essentially an advertisement against Cursor Pro and/or Claude Sonnet 3.5/3.7

I think personally when i tried tools like Void IDE, I was fighting with Void too much. It is beta software, it is buggy, but also the big one... learning curve of the tool.

I havent had the chance to try cursor but i imagine its going to have a learning curve as a new tool.

So perhaps there is a slowdown at first expected; but later after you get your context and prompting down pat. Asking specifically for what you want. Then you get your speed up.

I find myself having discussions with AI about different design possibilities and it sometimes comes up with ideas I hadn't thought of or features I wasn't aware of. I wouldn't classify this as "overuse" as I often find the discussions useful, even if it's just to get my thoughts down. This might be more relevant for larger scoped tasks or ones where the programmer isn't as familiar with certain features or libraries though.

The authors say "High developer familiarity with repositories" is a likely reason for the surprising negative result, so I wonder if this generalizes beyond that.

  • Like if it generalizes to situations where the developer is not familiar with the repo? That doesn’t seem like generalizing, that seems like specifying. Am I wrong in saying that the majority of developer time is spent in repos that they’re familiar with? Every job and project I’ve worked has been on a fixed set of repos the entire time. If AI is only helpful for the first week or two on a project, that’s not very many cases it’s useful for.

    • I'd say I write the majority of my code in areas I'm familiar with, but spend the majority of my _time_ on sections I'm not familiar with, and ai helps a lot more with the latter than the former. I've always felt my coding life is speeding through a hundred lines of easy code then getting stuck on the 101st. Then as I get more experienced that hundred becomes 150, then 200, but always speeding through the easy part until I have to learn something new.

      So I never feel like I'm getting any faster. 90% of my time is still spent in frustration, even when I'm producing twice the code at higher quality

  • Without the familiarity would the work be getting done effectively? What does it mean for someone to commit AI code that they can't fully understand?

I’m not surprised that AI doesn’t help people with 5+ years experience in open source contribution, but I’d imagine most people aren’t claiming AI tools are at senior engineer level yet.

Soon once the tools and how people use them improve AI won’t be a hinderance for advanced tasks like this, and soon after AI will be able to do these prs on their own. It’s inevitable given the rate of improvement even since this study.

  • Even for senior levels the claim has been that AI will speed up their coding (take it over) so they can focus on higher level decisions and abstract level concepts. These contributions are not those and based on prior predictions the productivity should have gone up.

    • It would be different I'm sure if they were making contributions to repos they had less familiarity with. In my experience and talking with those who use AI most effectively, it is best leveraged as a way of getting up to speed or creating code for a framework/project you have less familiarity with. The general ratio determining the effectiveness of non-AI coding vs AI coding is the familiarity the user has with the codebase * the complexity of the codebase : the amount of closed-loop abstractions in the tasks the coder needs to carry out.

      Currently AI is like a junior engineer, and if you don't have good experience managing junior engineers, AI isn't going to help you as much.

      1 reply →

One thing I could not find on a cursory read is how used were those developers to AI tools. I would expect someone using those regularly to benefit while someone who only played with them a couple of time would likely be slowed down as they deal with the friction of learning to be productive with the tool.

  • In this case though you still wouldn't necessarily know if the AI tools had a positive causal effect. For example, I practically live in Emacs. Take that away and no doubt I would be immensely less effective. That Emacs improves my productivity and without it I am much worse in no way implies that Emacs is better than the alternatives.

    I feel like a proper study for this would involve following multiple developers over time, tracking how their contribution patterns and social standing changes. For example, take three cohorts of relatively new developers: instruct one to go all in on agentic development, one to freely use AI tools, and one prohibited from AI tools. Then teach these developers open source (like a course off of this book: https://pragprog.com/titles/a-vbopens/forge-your-future-with...) and have them work for a year to become part of a project of their choosing. Then in the end, track a number of metrics such as leadership position in community, coding/non-coding contributions, emotional connection to project, social connections made with community, knowledge of code base, etc.

    Personally, my prior probability is that the no-ai group would likely still be ahead overall.

    • FWIW, LLM tooling for Emacs is great. gptel for example allows you to converse with wide-range of different models from anywhere in Emacs — you can spontaneously send requests while typing some text or even browsing M-x menu. I often do things like "summarize current paragraph in pdf document" or "create a few anki cards based on this web page content", etc.

My hot take: Cursor is a bad tool for agentic coding. Had a subscription and canceled it in favor of Claude Code. I don’t want to spend 90% of my time approving every line the agent wants to write. With Claude Code I review whole diffs - 1-2 minutes of the agent’s work at a time. Then I work with the agent at a level of what its approach is, almost never asking about specific lines of code. I can look at 5 files at once in git diff and then ask “why’d you choose that way?” “Can we undo that and try to find a simpler way?”

Cursor’s workflow exposes how differently different people track context. The best ways to work with Cursor may simply not work for some of us.

If Cursor isn’t working for you, I strongly encourage you to try CLI agents like Claude Code.

Very cool work! And I love the nuance in your methodology and findings. Anyway, I'm preparing myself for all the "Bombshell news: AI is slowing down developers" posts that are coming.

LLMs are godtier if you know what you’re doing, and prompt them with ”do X”, where x is a SELF-CONTAINED change you would manually know how to implement

For example, today I asked claude to implement per-user rate-limiting into my nestjs service, then iterated by asking implementing specific unit tests and some refactoring. It one-shot everything. I would say 90% time savings.

Unskilled people ask them ”i have giant problem X solve it” and end up with slop

  • I tried exactly that, several times, over and over.

    Except on "hello world" situations (which I guess is a solid part of the corpus LLMs are trained with) these tools were consistently slower.

    Last time was an area where several files were subtly different in a section that essentially does about the same thing, and needed to be aligned and made consistent†.

    Time to - begrudgingly - do it manually: 5min

    Time to come up with a one-shot shell incantation: 10min

    Time to very dumbly manually mark the areas with ===BEGIN=== and ===END=== and come up with a one-shot shell incantation: 3min

    Time to do it for the LLM: 45min††; also it required regular petting every 20ish command so zero chance of letting it run and doing something else†††.

    Time to review + manually fix the LLM output which missed two sections, left obsolete comments, and modified four files that were entirely unrelated yet clearly declared as out of scope in the prompt: 5min

    Consistently, proponents have been telling me "yeah you need to practice more, I'm getting fine results so you're holding it wrong, we can do a session together and I'll show you how to do it", which they do, and then it doesn't work, and they're like "well I'll look into it and circle back" and I never hear from them again.

    As for suggestions, for every good completion where I accept saying "oh well, why not", 99 get rejected: the majority are complete hallucinations absolutely unrelated to the surrounding logic, a third are either broken or introduce non-working code, and 1-5 _actively dangerous_ in some way.

    The only places where I found LLMs vaguely useful are:

    - Asking questions about an unknown codebase. It still hallucinates and misdirects or is excessively repetitive about some things (even with rules) but it can crudely draw a rough "map" and make non-obvious connections about two distant areas, which can be welcome.

    - Asking for a quick code review in addition to the one I ask to humans; 70% of such output is laughably useless (although harmless beyond the noise + energy cost), 30% is duplicate of human reviews but I can get it earlier, and sometimes it unearths a good point that has been overlooked.

    † No, the specific section cannot+should not be factored out

    †† And that's because I interrupted it because it was going about modifying files that it should not have.

    ††† A bit of a lie because I did the other three ways during that time. Which also is telling because the time to do the other ways would actually be _lower_ because I was interrupted by / had to keep tabs on what the AI agent was doing.

My theory is that, outside of programming skill, amazement by AI tools is inversely proportional to typing/navigating speed.

I already know what I need to write, I just need to get it into the editor. I wouldn’t trade the precision I have with vim macros flying across multiple files for an AI workflow.

I do think AI is a good rubber ducky sometimes tho, but I despise letting it take over editing files.

would you be worse without it? now prepare to pay $1000+/month for chatgpt in a few years when dust settles.

What if the slowdown isn't a bug but a feature? What if AI tools are forcing developers to think more carefully about their code, making them slower but potentially producing better results? AFAIK the study measured speed, not quality, maintainability, or correctness.

The developers might feel more productive because they're engaging with their code at a higher level of abstraction, even if it takes longer. This would be consistent with why they maintained positive perceptions despite the slowdown.

What is interesting here is that all predictions were positive, but results are negative.

This shows that everyone in the study (economic experts, ML experts and even developers themselves, even after getting experience) are novices if we look at them from the Dunning-Kruger effect [1] perspective.

[1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

"The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."

  • > "The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."

    No, they underestimated their own abilities for the most part; the estimates for AI-disallowed tasks were all undershot in terms of real implementation time.

    What they overestimated was the ability of LLMs to provide real productivity gains on a given task.

    •   > What they overestimated was the ability of LLMs to provide real productivity gains on a given task.
      

      This is exactly my point.

      This is not about the ability of LLM overestimated by developers, this is about the ability of developer interacting with LLM overestimated by developers themselves, economic experts and ML experts.

      LLMs are not "able" per se, they are "prompted" to be "able." They are not agents, but behave as agents on someone's behalf - and no one have a clue whether use of LLMs is positive or detrimental, with the bias being "LLM's use is net positive".

      The overestimation of LLM's abilities by everyone calls for Dunning-Kruger.

This does not take into account the fact that experienced developers working with AI have shifted into roles of management and triage, working on several tasks simultaneously.

Would be interesting (and in fact necessary to derive conclusions from this study) to see aggregate number of tasks completed per developer with AI augmentation. That is, if time per task has gone up by 20% but we clear 2x as many tasks, that is a pretty important caveat to the results published here

I really like those graphics, does anyone know the tool was used to create them?

  • The graphs are all matplotlib. The methodology figure is built in Figma! (Source: I'm a paper author :)).

N = 16 developers. Is this enough to draw any meaningful conclusions?

  • That depends on the size of the effect you’re trying to measure. If cursor provides a 5x, 10x, or 100x productivity boost as many people are claiming, you’d expect to see that in a sample size of 16 unless there’s something seriously wrong with your sample selection.

    If you are looking for a 0.1% increase in productivity, then 16 is too small.

    • Well it depends on the variance of the random variable itself. You're right that with big, obvious effects, a larger n is less "necessary". I could see individuals having very different "productivities", especially when the idea is flattened down to completion time.

      1 reply →

    • “A quarter of the participants saw increased performance, 3/4 saw reduced performance.” So I think any conclusions drawn on these 16 people doesn’t signify much one way or the other. Cool paper but how is this anything other than a null finding?

      1 reply →

Early 2025. I imagine the results would be quite different with mid 2025 models and tools.

  • If they used mid 2025 models and tools, the paper would have come out in late 2025, and you would have had the same complaint.

I wonder if the discrepancy is that it felt like it was taking less time because they were having to do less thinking which feels like it is easier and hence faster.

Even so... I still would be really surprised if there wasn't some systematic error here skewing the results, like the developers deliberately picked "easy" tasks that they already knew how to do, so implementing them themselves was particularly fast.

Seems like they authors had about as good methodology as you can get for something like this. It's just really hard to test stuff like this. I've seen studies proving that code comments don't matter for example... are you going to stop writing comments? No.

  • > which feels like it is easier and hence faster.

    We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!

    > like the developers deliberately picked "easy" tasks that they already knew how to do

    We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...

    [1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

> We pay developers $150/hr as compensation for their participation in the study.

Can someone point me to these 300k/yr jobs?

> The developers estimated how long it would take them to complete each task (a) under normal conditions

Ah, there’s your issue. There’s not a developer in human history who hasn’t drastically underestimated how long it would take to complete a task.

  • On average the developers overestimated how long tasks would take when not using AI; they undershot their estimates on average. The opposite happened with AI-assisted tasks.

    The conclusion isn't that "estimates are hard" (they can be), but rather that AI-assistance can lead people to believe they're being more productive than they actually are, because they incorrectly think they've spent less time.

    The graphs in the paper tell part of that story; the time that is being reduced is in actual programming time, "Reading & Searching", "Testing & Debugging", but that time is being spent elsewhere, notably in parts specific to LLMs (reviewing output, prompting, waiting for the AI to spit out results).

I really admire stories like this. Reaching $1M ARR without any funding is rare and feels real. It shows what building something truly takes. Late nights, tough moments, losing users. It's not about big bursts of growth but staying consistent, solving real problems, and growing revenue little by little. There's a lot to learn from that.

Hey guys why are we making it so complicated? do we really need a paper and study?

anyway -AI as the tech currently stand is a new skill to use and takes us humans time to learn, but once we do well, its becomes force multiplier

ie see this: https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...

  • Because for now, that's just what those financially profiting from the AI-hype tell us. Be it sama, hyung or nadella, they all profit if people _believe_ AI is a force multiplier for everybody. Reality is much more muddy though and it's absolutely not as obvious as those people claim.

    And keep in mind that a 5-10x price hike is to be expected if those companies keep spending billions to make millions.

    Right now, there is a consistent stream of papers incoming which indicates that AI might be much more of a specialized tool for very particular situations instead of the "solve everything"-tool the hype makes people believe. This is highly significant.

    "Just believe me bro" is just not enough.

Any time you see the word "measuring" in the context of software development, you know what follows will be nonsense and probably in service of someone's business model.