Really a great piece of work. At the opposite of the usual studies posted here.
With the headline we can easily guess that the study should be flawed, with the sample not representative, or developers not expert enough with the AI or so, and then, they give a very well done list of all valable objections with arguments about why they don't think that contradict the study.
That replied to all the questions I could have had in the end.
I think the dichotomy you see with how positive people are about ai has almost entirely to do with the kind of questions they ask.
That seems obvious, but a consequence of that is that people who are sceptical of ai (like me) only use it when they've exhausted other resources (like google). You ask very specific questions where not a lot of documentation is available and inevetably even o3 ends up being pretty useless.
Conversely there's people who love ai and use it for everything, and since the majority of the stuff they ask about is fairly simple and well documented (eg "Write me some typescript"), they rarely have a negative experience.
- Some people simply ask a lot more questions than others (this ignores whether they like or dislike AI), i.e. some people simply prefer to find things out more by themselves, and thus also use other resources like Google or Stack Overflow as a last resort. So their questions to an AI will likely be more complicated, because they already found out the easy parts by themselves.
- If I have to make the effort to explain to the AI in a sufficiently exhaustive way what I need (which I often have to do), I expect the answers of the AI to be really good. If it isn't, having explained my problem to the AI was simply a waste of time.
> I expect the answers of the AI to be really good. If it isn't, having explained my problem to the AI was simply a waste of time.
I find the worst part to be when it doesn't correct flaws in my assumptions.
For example, yesterday I asked it "what is the difference between these two Datadog queries"? And it replied something that was semi-correct, but it didn't discover the fundamental flaw - that the first one wasn't a valid query because of unbalanced parens. In fact, it turns out that the two strings (+ another one) would get concatenated and only then would it be a valid query.
A simple "the first string is not a valid query because of a missing closing paren" would have saved a lot of time in trying to understand this, and I suspect that's what I would have received if I had prompted it with "what's the problem with this query" but LLMs are just too sycophantic to help with these things.
I don't think that dichotomy is true at all, at least not with experienced software people.
Many folks I know are skeptical of the hype, or maybe full on anti/distrustful, due to reasons I think are valid. But many of those same people have tried llm tools, maybe chatgpt or copilot or cursor, and recognize the value even w/ huge misgivings. Some of have gone further with tools like claude code and seen the real potential there, quite a step beyond fancy auto-complete or just-in-time agents...but even there you can end up in rabbit-holes and drowning in horrible design.
In your incredibly reductive scale, I'm closer to 'love' than 'skeptical', but I'm often much of both sides. But I'd never write a prompt like 'write me some typescript' for any real work, or honestly anything close to that, unless its just for memes or demonstrations.
But no-one who programs for a living uses prompts like that, at least not for real work. That is just silly talk.
I obviously don't mean that people literally write "write me some typescript", because nobody wants code that does something arbitrary.
I'm also not saying that every reaction to ai falls between love and skeptical: I wrote a 3 sentence comment on a complex topic to sketch out an idea.
The tone of your comment suggests that my comment upset you, which wasn't my intent. But you have to try to be a little generous when you read other peoples stuff, or these discussion will get very tedious quickly.
I use it very routinely to generate tikz diagrams. It is obviously wrong and I need to manually tweak a little bit. But the hardest part is often to get something working at first, and in this AI is first class. It gets me 90% there, and rest is me.
I think you touched on an important aspect, but did not explore it further.
If we accept that AI is a tool, then then problem is the nature of the tool as it will vary heavily from individual to individual. This partially accounts for the ridiculous differences from self reported accounts of people, who use it on a regular basis.
And then, there is a possibility that my questions are not that unusual and/or are well documented ( quite possible ) so my perception of the usefulness of those answers is skewed.
My recent interaction with o4 was pretty decent on a very new ( by industry standards ) development and while documentation for it exists, it is a swirling vortex of insanity from where I sit. I was actually amazed to see how easily 4o saw some of those discrepancies and listed those to me along with likely pitfalls that may come with it. We will be able to find if that prediction holds v.soon.
The thing about tools is that they need to be predictable. I can't remember the source, but it's a concept I read that really stuck with me. A predictable tool can be used skillfully and accurately because the user can anticipate how it works and deploy it effectively. It will always be aligned with the user intent because the user decides how and when it is used.
A tool that constantly adapts to how it is used will frequently be misaligned with user intent. Language models constantly change their own behavior based on the specific phrasing you gave it, the context you deployed it in, and the inherent randomness in token generation. Its capacity to be used as a tool will be inherently limited by this unpredictability.
Well, I use it before google, since it in general summarizes webpages and removes the ads. Quite handy.
It’s also very useful to check if you understand something correctly. And for programming specifically I found it really useful to help naming stuff (which tends to be hard not in the least place because it’s subjective).
> You ask very specific questions where not a lot of documentation is available and inevetably even o3 ends up being pretty useless.
You have any example questions where o3 failed to be helpful?
I use it pretty similarly to you, only resorting to it to unblock myself basically, otherwise I'm mostly the one doing the actual work, LLMs help with specific functions or specific blockers, or exploring new "spaces". But almost all the times I've gotten stuck, o3 (and o3-pro mode) managed to unstuck me, once I've figured out the right way to ask the question, even when my own searching and reading didn't help.
I had to create a Cython module wrapping some C, used Claude 4 and GPT 4.1, they were worse than useless. One can imagine why I needed help with that project.
I am personally somewhere in-between these two places. I've used ChatGPT to get unstuck a few times this past week because I was at the end of my rope with regards to some GPU crashes that I couldn't make heads or tails of. I then used it for less headache-inducing things and overall it's been an interesting experience.
For research I'm enjoying asking ChatGPT to annotate its responses with sources and reading those; in some cases I've found SIGGRAPH papers that I wouldn't have stumbled upon otherwise, and it's nice to get them all in a response.
ChatGPT (4o, if it's of any interest) is very knowledgeable about DirectX12 (which we switched to just this week) and I've gained tons of peripheral knowledge with regards to the things I've been battling with, but only one out of four times has it been able to actually diagnose directly what the issue was; three separate times it's been something it didn't really bring up or note in any meaningful regard. What helped was really just me writing about it, thinking about everything around it and for that it's been very helpful.
Realistically, if someone let an agent running on this stuff loose on our code base it would likely end up wasting days of time and still not fix the issue. Even worse, the results would have to be tested on a specific GPU to even trigger the issue to begin with.
It seems to me that fancy auto-complete is likely the best this would be able to do still, and I actually like it for that. I don't use LLM-assisted auto-complete anymore, but I used to use GitHub Copilot back in 2022 and it was more productive than my brief tests of agents.
If I were to regularly use LLMs for actual programmit it would most likely be just for tab-completion of "rest of expressions" or 1 line at a time, but probably with local LLMs.
It's kind of true. I only use it for simple stuff that I don't have time for. For example, how to write a simple diagram in tikz. The Ai does the simple and busywork of providing a good enough approximation which I can tweak and get what I want.
For hard questions, I prefer to use my own skills, because AI often regurgitates what I'm already aware. I still ask AI in the off-chance it comes up with something cool, but most often, I have to do it myself.
What bothers me more than any of this particular discussion is that we seem to be incapable of determining programmer productivity in a meaningful way since my debut as a programmer 40 years ago.
I’m confused as to why anyone would think this would be possible to determine.
Like can we determine the productivity of doctors, lawyers, journalists, or pastry chefs?
What job out there is so simple that we can meaningfully measure all the positive and negative effects of the worker as well as account for different conditions between workers.
I could probably get behind the idea that you could measure productivity for professional poker players (given a long enough evaluation period). Hard to think of much else.
People in charge love to measure productivity and, just as harmfully, performance. The main insight people running large organisations (big business and governments) have into how they are doing is metrics, so they will use what measures they can have regardless of how meaningful they are.
The British government (probably not any worse than anyone else, just what I am most familiar with) does measure the productivity of the NHS: https://www.england.nhs.uk/long-read/nhs-productivity/ (including doctors, obviously).
They also try to measure the performance of teachers and schools and introduced performance league tables and special exams (SATS - exams sat at various ages school children in the state system, nothing like the American exams with the same name) to do this more pervasively. They made it better by creating multi-academy trusts which adds a layer of management running multi-schools so even more people want even more metrics.
The same for police, and pretty much everything else.
Yet paradoxically, the user knows instinctively. I know exactly when I'll get my next medical checkup, and when the test results will arrive. I know if a software app improves my work, and what it will cost to get a paid license.
The hard thing is occupations where the quantity of effort is unrelated to the result due to the vast number of confounding factors.
We can determine the productivity of factory workers, and that is still(!) how we are seen by some managers.
And to be fair, some crud work is repetitive enough so it should be possible to get a fair measure of at least the difference in speed between developers.
But that building simple crud services with rest interfaces takes as much time as it does is a failure of the tools we use.
> Like can we determine the productivity of doctors, lawyers, journalists, or pastry chefs?
Yes, yes we can.
Programmers really need to stop this cope about us being such special snowflakes that we can't be assessed and that our maangers just need to take that we're worth keeping around on good faith.
Part of that, may be what we measure “product” to be.
My entire life, I have written “ship” software. It’s been pretty easy to say what my “product” is.
But I have also worked at a fairly small scale, in very small teams (often, only me). I was paid to manage a team, but it was a fairly small team, with highly measurable output. Personally, I have been writing software as free, open-source stuff, and it was easy to measure.
Some time ago, someone posted a story about how most software engineers have hardly ever actually shipped anything. I can’t even imagine that. I would find that incredibly depressing.
It would also make productivity pretty hard to measure. If I spent six months, working on something that never made it out of the cręche, would that mean all my work was for nothing?
Also, really experienced engineers write a lot less code (that does a lot more). They may spend four hours, writing a highly efficient 20-line method, while a less-experienced engineer might write a passable 100-line method in a couple of hours. The experienced engineers’ work might be “one and done,” never needing revision, while the less-experienced engineer’s work is a slow bug farm (loaded with million-dollar security vulnerability tech debt), which means that the productivity is actually deferred, for the more experienced engineer. Their manager may like the less-experienced engineer's work, because they make a lot more noise, doing it, are "faster," and give MOAR LINES. The "down-the-road" tech debt is of no concern to the manager.
I worked for a company that held the engineer Accountable, even if the issue appears, two years after shipping. It encouraged engineers to do their homework, and each team had a dedicated testing section, to ensure that they didn't ship bugs.
When I ask ChatGPT (for example) for a code solution, I find that it’s usually quite “naive” (pretty prolix). I usually end up rewriting it. That doesn’t mean that’s a bad thing, though. It gives me a useful “starting point,” and can save me several hours of experimenting.
> When I ask ChatGPT (for example) for a code solution, I find that it’s usually quite “naive” (pretty prolix). I usually end up rewriting it. That doesn’t mean that’s a bad thing, though. It gives me a useful “starting point,” and can save me several hours of experimenting.
The usual counter-point is that if you (commonly) write code by experimenting, you are doing it wrong. Better think the problem through, and then write decent code (that you finally turn into great code). If the code that you start with is that as "naive" as you describe, in my experience it is nearly always better to throw it away (you can't make gold out of shit) and completely start over, i.e. think the problem through and then write decent code.
But nevertheless, productivity objectively exists. Some people/teams are more productive as others.
I suppose it would be simpler to compare productivity for people working on standard, "normalized" tasks, but often every other task a programmer is assigned is something different to the previous one, and different developers get different tasks.
It's difficult to measure productivity based on real-world work, but we can create an artificial experiment: give N programmers the same M "normal", everyday tasks and observe whether those using AI tools complete them more quickly.
This is somewhat similar to athletic competitions — artificial in nature, yet widely accepted as a way to compare runners’ performance.
We can determine productivity for the purpose of studies like this. Give a bunch of developers the exact same task and measure how quickly they can produce a defect-free solution. Unfortunately, this study didn’t do that – the developers chose their own tasks.
Is there any AI that can create a defect-free solution to non-trivial programming problems without supervision? This has never worked in any of my tests, so I suspect the answer is currently No.
what about the $ you make? isn't that an indicator? you've probably made more than me, so you are more successful while both of us might be doing the same thing.
Salary is an indirect and partially useful metric, but one could argue that your ability to self-promote matters more, at least in the USA. I worked at Microsoft and saw that some of the people who made fat stacks of cash, just happened to be at the right place in the right time, or promoted things that looked good, but we’re not good for the company itself.
I made great money running my own businesses, but the vast majority of the programming was by people I hired. I’m a decent talent, but that gave me the ability to hire better ones than me.
what about the $ you generate? im a software developer consultant. we charge by the hour. up front, time and materials, and/or support hours. not too many leaps of logic to see there is a downside to completing a task too quickly or too well
i have to bill my clients and have documented around 3 weeks of development time saved by using LLMs to port other client systems to our system since December. on one hand this means we should probably update our cost estimates, but im not management so for the time ive decided to use the saved time to overdeliver on quality
eventually clients might get wise and not want to overdeliver on quality and we would charge less according to time saved by LLMs. despite a measured increase in "productivity" i would be generating less $ because my overall billable hour % decreases
hopefully overdelivering now reduces tech debt to reduce overhead and introduces new features which can increase our client pipeline to offset the eventual shift in what we charge our clients. thats about all the agency i can find in this situation
It is from a certain point of view. For example at a national level productivity is measured in GDP per hour worked. Even this is problematic - it means you increase productivity by reducing working hours or making low paid workers unemployed.
ON the other hand it makes no sense from some points of view. For example, if you get a pay rise that does not mean you are more productive.
In a vacuum I don’t believe pay alone is a very good indicator. What might be a better one is if someone has a history across their career of delivering working products to spec, doing this across companies and with increasing responsibility. This of course can only be determined after the fact.
Probably not, I took a new job at a significantly reduced pay because it makes me feel better and reduced stress. That fact that I can allow myself to work for less seems to me like I'm more successful.
Productivity has zero to do with salary. Case in point: FOSS.
Some of the most productive devs don't get paid by the big corps who make use of their open source projects, hence the constant urging of corps and people to sponsor projects they make money via.
People doing charity work, work for non-profits or work for public benefit corporations typically have vastly lower wages than those who work in e.g high frequency trading or other capital-adjacent industries. Are you comfortable declaring that the former is always vastly less productive than the latter?
Changing jobs typically brings a higher salary than your previous job. Are you saying that I'm significantly more productive right after changing jobs than right before?
I recently moved from being employed by a company to do software development, to running my own software development company and doing consulting work for others. I can now put in significantly fewer hours, doing the same kind of work (sometimes even on the same projects that I worked on before), and make more money. Am I now significantly more productive? I don't feel more productive, I just learned to charge more for my time.
IMO, your suggestion falls on its own ridiculousness.
Team members always know who is productive and who isn’t, but generally don’t snitch to the management because it will be used against them or cause conflicts with colleagues.
This team-level productivity doesn’t necessarily translate into something positive for a company.
Management is forced to rely on various metrics which are gamed or inaccurate.
I find the swings to be wild, when you win with it, you win really big. But when you lose with it, it's a real bite out of your week too. And I think 10x to 20x has to be figurative right, you can do 20x by volume maybe, but to borrow an expression from Steve Ballmer, that's like measuring an airplane by kilograms.
Someone already operating at the very limit of their abilities doing stuff that is for them high complexity, high cognitive load, detail intense, and tactically non-obvious? Even a machine that just handed you the perfect code can't 20x your real output, even if it gave you the source file at 20x your native sophistication you wouldn't be able to build and deploy it, let alone make changes to it.
But even if it's the last 5-20% after you're already operating at your very limit and trying to hit your limit every single day is massive, it makes a bunch of stuff on the bubble go from "not realistic" to "we did that".
There are definitely swings. Last night it took about 2 hours to get Monaco into my webpack built bootstrap template, it came down to CSS being mishandled and Claude couldn't see the light. I just pasted the code into ChatGPT o3 and it fixed it first try. I pasted the output of ChatGPT into Claude and viola, all done.
A key skill is to sense when the AI is starting to guess for solutions (no different to human devs) and then either lean into another AI or reset context and start over.
I'm finding the code quality increase greatly with the addition of the text 'and please follow best practices because will be pen tested on this!' and wow.. it takes it much more seriously.
> Someone already operating at the very limit of their abilities doing stuff that is for them high complexity, high cognitive load, detail intense, and tactically non-obvious?
How much of the code you write is actually like this? I work in the domain of data modeling, for me once the math is worked out majority of the code is "trivial". The kind of code you are talking about is maybe 20% of my time. Honestly, also the most enjoyable 20%. I will be very happy if that is all I would work on while rest of it done by AI.
> Someone already operating at the very limit of their abilities doing stuff that is for them high complexity, high cognitive load, detail intense, and tactically non-obvious?
When you zoom in, even this kind of work isn't uniform - a lot of it is still shaving yaks, boring chores, and tasks that are hard dependencies for the work that is truly cognitively demanding, but themselves are easy(ish) annoyances. It's those subtasks - and the extra burden of mentally keeping track of them - that sets the limit of what even the most skilled, productive engineer can do. Offloading some of that to AI lets one free some mental capacity for work that actually benefits from that.
> Even a machine that just handed you the perfect code can't 20x your real output, even if it gave you the source file at 20x your native sophistication you wouldn't be able to build and deploy it, let alone make changes to it.
Not true if you use it right.
You're probably following the "grug developer" philosophy, as it's popular these days (as well as "but think of the juniors!", which is the perceived ideal in the current zeitgeist). By design, this turns coding into boring, low-cognitive-load work. Reviewing such code is, thus, easier (and less demoralizing) than writing it.
20x is probably a bit much across the board, but for the technical part, I can believe it - there's too much unavoidable but trivial bullshit involved in software these days (build scripts, Dockerfies, IaaS). Preventing deep context switching on those is a big time saver.
I agree I feel more productive. AI tools do actually make it easier and makes my brain use less energy. You would think that would be more productive but maybe it just feels that way.
Stage magicians say that the magic is done in the audiences memory after the trick is done. It's the effect of the activity.
AI coding tools makes developers happier and able to spend more brain power on actually difficult things. But overall perhaps the amount of work isn't in orders of magnitudes it just feels like it.
Waze the navigation app routes you in non standard routes so that you are not stuck in traffic, so it feels fast that you are making progress. But the time taken may be longer and the distance travelled may be further!
Being in stuck traffic and not moving even for a little bit makes you feel that time has stopped, it's boring and frustrating. Now developers need never be stuck. Their roads will be clear, but they may take longer routes.
We get little boosts of dopamine using AI tools to do stuff. Perhaps we used these signals as indicators of productivity "Ahh that days work felt good, I did a lot"
> Waze the navigation app routes you in non standard routes so that you are not stuck in traffic, so it feels fast that you are making progress. But the time taken may be longer and the distance travelled may be further!
You're not "stuck in traffic", you are the traffic. If the app distributes users around and this makes it so they don't end up in traffic jams, it's effectively preventing traffic jams from forming
I liked your washing machine vs. sink example that I see you just edited out. The machine may do it slower and less efficiently than you'd do in the sink, but the machine runs in parallel, freeing you to do something else. So is with good use of LLMs.
Can't help but note that in 99% cases this "difficult things" trope makes little sense. In most jobs, the freed time is either spent on other stupid tasks or is lost due to org inefficiencies, or is just procrastinated.
Where I have found Claude most helpful is on problems with very specific knowledge requirements.
Like: Why isn’t this working? Here Claude read this like 90 page PDF and tell me where I went wrong interfacing with this SDK.
Ohh I accidentally passed async_context_background_threading_safe instead of async_context_thread_safe_poll and it’s so now it’s panicking. Wow that would have taken me forever.
I wish I could. Some problems are difficult to solve and I still need to pay the bills.
So I work 8 hours a day (to get money to eat) and code another 4 hours at home at night.
Weekends are both 10 hour days, and then rinse / repeat.
Unfortunately some projects are just hard to do and until now, they were too hard to attempt to solve solo. But with AI assistance, I am literally moving mountains.
The project may still be a failure but at least it will fail faster, no different to the pre-AI days.
I bet with a co-worker that a migration from angular 15 to angular 19 could be done really fast avoiding months. I spent a whole evening on it and Claude code have never been able to pull off a migration from 15 to 16 on its own. A total waste of time and nothing worked. I had the surprise that it cost me 275$ for nothing. So maybe for greenfield projects it’s smooth and saves time but it’s not a silver bullet on projects with problems.
I cringe when I see these numbers. 20 times better means that you can accomplish in two months what you would do in 4 years, which is ridiculus when said out loud. We can make it even more ridiculous by pointing out you would do in 3 years the work of working lifetime (60 years)
I am wondering, what sort of tasks are you seeing these x20 boost?
I scoped out a body of work and even with the AI assisting on building cards and feature documentation, it came to about 2 to 4 weeks to implement.
It was done in 2 days.
The key I've found with working as fast as possible is to have planning sessions with Claude Code and make it challenge you and ask tons of questions. Then get it to break the work into 'cards' (think Jira, but they are just .md files in your repo) and then maintain a todo.md and done.md file pair that sorts and organizes work flow.
Then start a new context, tell it to review todo.md and pick up next task, and burn through it, when done, commit and update todo.md and done.md, /compact and you're off on the next.
It's more than AI hinting at what to do, it's a whole new way of working with rigor and structure around it. Then you just focus fire on the next card, and the next, and if you ever think up new features, then card it up and put it in the work queue.
You are extrapolating over years as if a programmer’s task list is consistent.
Claude code has made bootstrapping a new project, searching for API docs, troubleshooting, summarizing code, finding a GitHub project, building unit tests, refactoring, etc easily 20x faster.
It’s the context switching that is EXTREMELY expensive for a person, but costless for the LLM. I can focus on strategy (planning features) instead of being bogged down in lots of tactics (code warnings, syntax errors).
Claude Code is amazing, but the 20x gains aren’t evenly distributed. There are some projects that are too specialized (obscure languages, repos larger than the LLM’s context window, concepts that aren’t directly applicable to any codebase in their training corpus, etc). But for those of us using common languages and commodity projects, it’s a massive force multiplier.
I built my second iOS app (Swift) in about 3 days x 8 hours of vibe coding. A vocab practice app with adjustable learning profile, 3 different testing mechanisms, gamification (awards, badges), iOS notifications, text to speech, etc. My first iOS app was smaller, mostly a fork of another app, and took me 4 weeks of long days. 20x speed up with Claude Code is realistic.
And it saves even more time when researching + planning which features to add.
It isn’t ridiculous, it’s easily true, especially when you’re experienced in general, but have little to no knowledge of this particular big piece of tech, like say you’ve stopped doing frontend when jquery was all there was and you’re coming back. I’m doing things with react in hours I would have no business doing in weeks a couple years ago.
So were the people taking the study. Which is why we do these, to understand where our understanding of ourselves is lacking.
Maybe you are special and do get extra gains. Or maybe you are as wrong about yourself as everyone else and are overestimating the gains you think you have.
I don't know the PDF.js library. Writing both the client- and server-side for a PDF annotation editor would have taken 60 hours, maybe more. Instead, a combination Copilot, DeepSeek, Claude, and Gemini yielded a working prototype in under 6 hours:
As others probably have experienced, I can only add that I am doing coding now I would have kicked down the road if I did not have LLM assistance.
Example: using LeafletJS — not hard, but I didn't want to have to search all over to figure out how to use it.
Example: other web page development requiring dropping image files, complicated scrolling, split-views, etc.
In short, there are projects I have put off in the past but eagerly begin now that LLMs are there to guide me. It's difficult to compare times and productivity in cases like that.
This is pretty similar to my own experience using LLMs as a tool.
When I'm working with platforms/languages/frameworks I am already deeply familiar with I don't think they save me much time at all. When I've tried to use them in this context they seem to save me a bunch of time in some situations, but also cost me a bunch of time in others resulting in basically a wash as far as time saved goes.
And for me a wash isn't worth the long-term cost of losing touch with the code by not being the one to have crafted it.
But when it comes to environments I'm not intimately familiar with they can provide a very easy on-ramp that is a much more pleasant experience than trying to figure things out through often iffy technical documentation or code samples.
Leaflet doc is single page document with examples you can copy-paste. There is page navogation at the top. Also ctrl/cmd+f and keyword seems quicker than writing the prompt.
Nice. I'm afraid I simply assumed, like other "frameworks", it was going to entail wandering all over StackOverflow, etc.
Still, when I simply told Claude that I wanted the pins to group together when zoomed out — it immediately knew I meant "clustering" and added the proper import to the top of the HTML file ... got it done.
I think this for me is the most worrying: "You can see that for AI Allowed tasks, developers spent less time researching and writing code".
My analogy to this is seeing people spend time trying to figure out how to change colors, draw shapes in powerpoint, rather than focus on the content and presentation. So here, we have developers now focusing their efforts on correcting the AI output, rather than doing the research and improving their ability to deliver code in the future.
I find I’m most likely to use an LLM to generate code in certain specific scenarios: (i) times I’m suffering from “writer’s block” or “having trouble getting started”; (ii) a language or framework I don’t normally use; (iii) feeling tired/burnt out/demotivated
When I’m in the “zone” I wouldn’t go near an LLM, but when I’ve fallen out of the “zone” they can be useful tools in getting me back into it, or just finishing that one extra thing before signing off for the day
I think the right answer to “does LLM use help or hinder developer productivity” is “it depends on how you use them”
It can get over some mental blocks, having some code to look at can start the idea process even it’s wrong (just like for writing). I don’t think it’s bad, like I don’t think writing throw away code for prototyping is a bad way to start a project that you aren’t sure how to tackle. Waterfall (lots of research and design up front) is still not going to work even if you forgo AI.
> To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years. Developers provide lists of real issues (246 total) that would be valuable to the repository—bug fixes, features, and refactors that would normally be part of their regular work. Then, we randomly assign each issue to either allow or disallow use of AI while working on the issue. When AI is allowed, developers can use any tools they choose (primarily Cursor Pro with Claude 3.5/3.7 Sonnet—frontier models at the time of the study); when disallowed, they work without generative AI assistance. Developers complete these tasks (which average two hours each) while recording their screens, then self-report the total implementation time they needed. We pay developers $150/hr as compensation for their participation in the study.
So it's a small sample size of 16 developers. And it sounds like different tasks were (randomly) assigned to the no-AI and with-AI groups - so the control group doesn't have the same tasks as the experimental group. I think this could lead to some pretty noisy data.
Interestingly - small sample size isn't in the list of objections that the auther includes under "Addressing Every Objection You Thought Of, And Some You Didn’t".
I do think it's an interesting study. But would want to see if the results could be reproduced before reading into it too much.
I think the productivity gains most people rave about are stuff like, I wanted to do X which isn't hard if you are experienced with library Y and library Y is pretty popular and the LLM did it perfectly first try!
I think that's where you get 10-20x. When you're working on niche stuff it's either not gonna work or work poorly.
For example right now I need to figure out why an ffmpeg filter doesn't do X thing smoothly, even though the C code is tiny for the filter and it's self contained.. Gemini refuses to add comments to the code. It just apologizes for not being able to add comments to 150 lines of code lol.
However for building an ffmpeg pipeline in python I was dumbfounded how fast I was prototyping stuff and building fairly complex filter chains which if I had to do by hand just by reading the docs it would've taken me a whole lot more time, effort and frustration but was a joy to figure out with Gemini.
So going back to the study, IMO it's flawed because by definition working on new features for open source projects wouldn't be the bread and butter of LLMs however most people aren't working on stuff like this, they're rewriting the same code that 10000 other people have written but with their own tiny little twist or whatever.
I really think they excel at greenfield work, and are “fine” at writing code for existing systems. When you are unfamiliar with a library or a pattern it’s a huge time saver.
So agree with that - but on the other hand surely the number of developers matters here? For example, if instead of 16 developers the study consisted of a single developer completing all 246 tasks with or without AI, and comparing the observed times to complete, I think most people would question the reproducibility and relevancy of the study?
Whilst my recent experience possibly agrees with the findings, I came here to moan about the methods. Whether it's 16 or 246, that's still a miserably small sample size.
One thing I find frustrating with these conversations is the _strict_ focus on single-task productivity.
Arguably, on a single coding task, I don't really move that much faster. However, I have much, much more brain capacity left both while coding and when I'm done coding.
This has two knock on effects:
1. Most simply, I'm productive for longer. Since LLMs are doing a lot of the heavy lifting, my brain doesn't have to constantly think. This is especially important in time periods where I'd previously have too little mental energy to think deeply about code.
2. I can do other things while coding. Well, right now, Cursor is cycling on a simple task while I type this. Most days, though, I'm responding to customers, working on documentation/planning, or doing some other non-coding task that's critical to my workflows. This is actually where I find my biggest productivity gains. Instead of coding THEN X, I can now do coding WITH X.
Context shifting while trying to code seems like a bad idea to me
Maybe you're some incredible multi -tasking genius able to change tasks rapidly without losing any of the details, but I suspect if most people tried this workflow they would produce worse code and also whatever their other output is would be low quality too
The article brushed aside devs being terrible at estimates, but I dunno.
I'm a frontend guy, been using Claude Code for a couple of weeks now. It's been able to speed up some boilerplate, it's sped up a lot of "naming is hard" conversations I like to have (but my coworkers probably don't, lol), it's enabled me to do a lot more stuff in my most recent project.
But for a task or two I suspect that it has slowed me down. If I'm unable to articulate the problem well enough and the problem is hard enough you can go in circles for awhile. And I think the nature of "the right answer is just around the corner" makes it hard to timebox or find a specific point where you say "yup, time to ditch this and do it the old-fashioned way". There is a bit of a slot-machine effect here.
> But for a task or two I suspect that it has slowed me down
Likely more, as it takes longer for you to activate your brain when your first thought is to ask an LLM rather than solve it yourself. Its like people reaching for a calculator to do 4+5, that doesn't make you faster or more accurate.
LLMs make me 10-20x more productive in frontend work which I barely do.
But when it comes to low-level stuff (C/C++) I personally don't find it too useful. it just replaces my need to search stackoverflow.
edit: should have mentioned the low-level stuff I work on is mature code and a lot of times novel.
This is good if front end is something you just need to get through. It's terrible if your work is moving to involve a lot of frontend - you'll never pick up the skills yourself
As the fullstacker with a roughly 65/35 split BE/FE on the team who has to review this kinda stuff on the daily, there's nothing I dread more than a backender writing FE tickets and vice versa.
Just last week I had to review some monstrosity of a FE ticket written by one of our backenders, with the comment of "it's 90% there, should be good to takeover". I had to throw out pretty much everything and rewrite it from scratch. My solution was like 150 lines modified, whereas the monstrous output of the AI was non-functional, ugly, a performance nightmare and around 800 lines, with extremely unhelpful and generic commit messages to the tune of "Made things great!!1!1!!".
I can't even really blame them, the C-level craze and zeal for the AI shit is such that if you're not doing crap like this you get scrutinized and PIP'd.
At least frontenders usually have some humility and will tell you they have no clue if it's a good solution or not, while BEnders are always for some reason extremely dismissive of FE work (as can be seen in this very thread). It's truly baffling to me
Interesting, I find the exact opposite. Although to a much lesser extent (maybe 50% boost).
I ended shoehorned into backend dev in Ruby/Py/Java and don't find it improves my day to day a lot.
Specifically in C, it can bang out complicated but mostly common data-structures without fault where I would surely do one-off errors. I guess since I do C for hobby I tend to solve more interesting and complicated problems like generating a whole array of dynamic C-dispatchers from a UI-library spec in JSON that allows parsing and rendering a UI specified in YAML. Gemini pro even spat out a YAML-dialect parser after a few attempts/fixes.
Maybe it's a function of familiarity and problems you end using the AI for.
This is exactly my experience as well. I've had agents write a bit of backend code, always small parts. I'm lucky enough to be experienced enough with code I didn't write to be able to quickly debug it when it fails (and it always fails from the first run). Like using AI to write a report, it's good for outlines, but the details are always seemingly random as far as quality.
For frontend though? The stuff I really don't specialize in (despite some of my first html beginning on FrontPage 1997 back in 1997), it's a lifesaver. Just gotta be careful with prompts since so many front end frameworks are basically backend code at this point.
I've been hacking on some somewhat systemsy rust code, and I've used LLMs from a while back (early co-pilot about a year ago) on a bunch of C++ systems code.
In both of these cases, I found that just the smart auto-complete is a massive time-saver. In fact, it's more valuable to me than the interactive or agentic features.
Here's a snippet of some code that's in one of my recent buffers:
// The instruction should be skipped if all of its named
// outputs have been coalesced away.
if ! self.should_keep_instr(instr) {
return;
}
// Non-dropped should have a choice.
let instr_choice =
choices.maybe_instr_choice(instr_ref)
.expect("No choice for instruction");
self.pick_map.set_instr_choice(
instr_ref,
instr_choice.clone(),
);
// Incref all named def inputs to the PIR choice.
instr_choice.visit_input_defs(|input_def| {
self.def_incref(input_def);
});
// Decref all named def inputs to the SIR instr.
instr.visit_inputs(
|input_def| self.def_decref(input_def, sir_graph)
);
The actual code _I_ wrote were the comments. The savings in not having to type out the syntax is pretty big. About 80% of the time in manual coding would have been that. Little typos, little adjustments to get the formatting right.
The other nice benefit is that I don't have to trust the LLM. I can evaluate each snippet right there and typically the machine does a good job of picking out syntactic style and semantics from the rest of the codebase and file and applying it to the completion.
The snippet, if it's not obvious, is from a bit of compiler backend code I'm working on. I would never have even _attempted_ to write a compiler backend in my spare time without this assistance.
For experienced devs, autocomplete is good enough for massive efficiency gains in dev speed.
I still haven't warmed to the agentic interfaces because I inherently don't trust the LLMs to produce correct code reliably, so I always end up reviewing it, and reviewing greenfield code is often more work than just writing it (esp now that autocomplete is so much more useful at making that writing faster).
It works with low-level C/C++ just fine as long as you rigorously include all relevant definitions in the context window, provide non-obvious context (like the lifecycle of some various objects) and keep your prompts focused.
Things like "apply this known algorithm to that project-specific data structure" work really well and save plenty of time. Things that require a gut feeling for how things are organized in memory don't work unless you are willing to babysit the model.
This feels like a parallel to the Gell-Mann amnesia effect.
Recently, my company has been investigating AI tools for coding. I know this sounds very late to the game, but we're a DoD consultancy and one not traditional associated with software development. So, for most of the people in the company, they are very impressed with the AI's output.
I, on the other hand, am a fairly recent addition to the company. I was specifically hired to be a "wildcard" in their usual operations. Which is too say, maybe 10 of us in a company of 3000 know what we're doing regarding software (but that's being generous because I don't really have visibility into half of the company). So, that means 99.7% of the company doesn't have the experience necessary to tell what good software development looks like.
The stuff the people using the AI are putting out is... better than what the MilOps analysts pressed into writing Python-scripts-with-delusions-of-grandeur were doing before, but by no means what I'd call quality software. I have pretty deep experience in both back end and front end. It's a step above "code written by smart people completely inexperienced in writing software that has to be maintained over a lifetime", but many steps below, "software that can successfully be maintained over a lifetime".
Well, that's what you'd expect from an LLM. They're not designed to give you the best solution. They're designed to give you the most likely solution. Which means that the results would be expected to be average, as "above average" solutions are unlikely by definition.
You can tweak the prompt a bit to skew the probability distribution with careful prompting (LLMs that are told to claim to be math PHDs are better at math problems, for instance), but in the end all of those weights in the model are spent to encode the most probable outputs.
So, it will be interesting to see how this plays out. If the average person using AI is able to produce above average code, then we could end up in a virtuous cycle where AI continuously improves with human help. On the other hand, if this just allows more low quality code to be written then the opposite happens and AI becomes more and more useless.
Before the industrial revolution a cabinetmaker would spend a significant amount of time advancing from apprentice to journeyman to master using only hand tools. Now master cabinetmakers that only use hand tools are exceedingly rare, most furniture is made with power tools and a related but largely different skillset.
When it comes to software the entire reason maintainability is a goal is because writing and improving software is incredibly time consuming and requires a lot of skill. It requires so much skill and time that during my decades in industry I rarely found code I would consider quality. Furthermore the output from AI tools currently may have various drawbacks, but this technology is going to keep improving year over year for the foreseeable future.
As a front-of-the-frontend guy, I think it's terrible with CSS and SVG and just okay with HTML.
I work at a shop where we do all custom frontend work and it's just not up to the task. And, while it has chipped in on some accessibility features for me, I wouldn't trust it to do that unsupervised. Even semantic HTML is a mixed bag: if you point out something is a figure/figcaption it'll probably do it right, but I haven't found that it'll intuit these things and get it right on the first try.
But I'd imagine if you don't care about the frontend looking original or even good, and you stick really closely to something like tailwind, it could output something good enough.
And critically, I think a lot of times the hardest part of frontend work is starting, getting that first iteration out. LLMs are good for that. Actually got me over the hump on a little personal page I made a month or so ago and it was a massive help. Put out something that looked terrible but gave me what I needed to move forward.
It's astonishing. A bit scary actually. Can easily see the role of front-end slowly morphing into a single person team managing a set of AI tools. More of an architecture role.
They averaged producing 47% more code on the AI tasks, but took only 20% more time. The report here biases over these considerations, but I’m left wondering: was the extra code superfluous or did this produce better structure / managed debt better? If that extra 47% of code translates to lower debt and more consistent throughput over the long term, I might take it, given how crushed projects get from debt. Anyway, it’s all hyperbole because there are massive statistical differences in the outcomes but no measures as to what they mean, but I’m sure they have meaning. That meaning matters a ton.
> They averaged producing 47% more code on the AI tasks, but took only 20% more time. The report here biases over these considerations, but I’m left wondering: was the extra code superfluous or did this produce better structure / managed debt better? If that extra 47% of code translates to lower debt and more consistent throughput over the long term, I might take it, given how crushed projects get from debt.
Wouldn't it be the opposite? I'd expect the code would be 47% longer because it's worse and heavier in tech debt (e.g. code repeated in multiple places instead of being factored out into a function).
Honestly my experience from using AI to code (primarily claude sonnet) is that that "extra 47%" is probably itself mostly tech debt. Places where the AI repeated itself instead of using a loop. Places where the AI wrote tests that don't actually test anything. Places where the AI failed to produce a simple abstraction and instead just kept doing the same thing by hand. Etc.
AI isn't very good at being concise, in my experience. To the point of producing worse code. Which is a strange change from humans who might just have a habit of being too concise, but not by the same degree.
Your response implies the ai produced code was landed without review. That’s a possible outcome but I would hope it’s unlikely to account for the whole group at this scale. We’re of course still lacking data.
Can we have a linter for both high verbosity/repetitiveness and high terseness? I know copy-paste detector and cognitive complexity calculator linters are related. I recently generated code that interleaved spreadsheet worksheets (multiple of them) and cell formatting boilerplate with querying data. I asked AI to put the boilerplate into another class and expose .write_balance_row() and it did it perfectly. If a tool reported it, huge changes dont have to reach human reviewers and AIs can iterate and pass the linter.
All source code is technical debt. If you increase the amount of code, you increase the amount of debt. It's impossible to reduce debt with more code. The only way to reduce debt is by reducing code.
(and note that I'm not measuring code in bytes here; switching to single-character variable names would not reduce debt. I'm measuring it in statements, expressions, instructions; reducing those without reducing functionality decreases debt)
I'll try a counterargument. If more code is more technical debt then writing more succinct code is less technical debt. But succinct code is often harder to grok and maintain than code written for the average Joe dev. So less code can sometimes mean less maintainability and thus more technical debt.
I think you instead meant to say more business logic implemented in code is more technical debt, not necessarily just more code.
Now do a study that specifically gauges how useful an LLM (including smart tab completion) is for a frontend dev working in react/next/tailwind on everyday Jira tickets.
These were maintainers of large open source projects. It's all relative. It's clearly providing massive gains for some and not as much for others. It should follow that it's benefit to you depends on who you are and what you are working on.
It's a very well controlled study about... what the study claims to do. Yes, they didn't study a different thing, for _many_ reasons. Yes, we shouldn't haphazardly extrapolate to other parts of Engineering. But it looks like it's a good study nonetheless.
There are some very good findings though, like how the devs thought they were sped up but they were actually slowed down.
As a backend dev who owns a few internal crappy frontends, LLMs have been the best thing ever. Code quality isn't the top priority, I just need to plumb some data to an internal page at BigCorp.
React and tailwind already made lot of tradeoffs to make it more ergonomic for developers. One would expect that LLMs could unlock lean and faster stack instead.
Perhaps is difficult to measure personal productivity in programming, but we can measure that we will run more slowly with 10 kg. in our backpack. I propose this procedure: The SWE selects 10 tasks and guesses some measure of their complexity (time to finish them) and then he randomly select 5 to be done with AI and the rest without. He performs them and finally calculates a deviation D. The deviation D = D_0 - D_1 where D_i = sum (real_time/guessed_time - 1), where D_0 is using AI and D_1 is without AI, the sign and magnitude of D measure respectively if the use of AI is beneficial or detrimental and the impact of using AI. Also, clipping individuals addends to be in the interval [-0.5,0.5] should avoid one bad guess dominating the estimation. Sorry if this is a trivial ideal but it is feasible and intuitively should provide useful information if the tasks are taken among the ones in which each initial guessing has small deviation. A filter should be applied to tasks in which scaffolding by AI surpass a certain relative threshold in case we are interested in generalizing our results to tasks in which scaffolding is not dominating time.
It could happen that the impact of using AI depends of the task at hand, the capability of the SWE to pair programming with it, and of the LLM used, to such an extend that those factors were bigger that the average effect on a bag of tasks, in this case the large deviation from the mean makes any one parameter estimation void of useful information.
What if this is true? And then we as a developer community are focused on the wrong thing to increase productivity?
Like what if by focusing on LLMs for productivity we just reinforce old-bad habits, and get into a local maxima... And even worse, what if being stuck with current so-so patterns, languages, etc means we don't innovate in language design, tooling, or other areas that might actually be productivity wins?
imagine having interstate highways built in one night you wake up and you have all these highways and roads and everyone is confused what they are and how to use them. using llm is the opposite of boiling frogs because you're not the leader writing, you're just suggesting... i just realized i might not know what im talking about.
We were stuck near local maxima since before LLM's came on the scene. I figure the same concentration of innovators are gonna innovate, now LLM assisted, and the same concentration of best-practice folk are gonna best-practice--now LLM assisted. Local maxima might get sticker, but greener pastures will be found more quickly than ever.
Honestly the biggest hindrance of developer productivity right now is probably perpetual, looming layoffs, not lacking AI, tools, programming languages, etc :)
AI could make me more productive, I know that for a fact. But, I don't want to be more productive because the tasks that could be automated with AI are those I find enjoyable. Not always in an intellectual sense, but in a meditative sense. And if I automated those away, I think I would become less human.
I have never found a measure of programmer productivity that makes sense to me, but I can say that LLM coding tools are way more distracting to me than they are worth. They constantly guess at what I may type next, are often wrong, and pop in with suggestions breaking my mental flow and making me switch from the mindset of coding to the mindset of reviewing code.
The more I used it, the easier it became to skip over things I should have thought through myself. But looking back, the results weren’t always faster or better.
Now I prefer to treat AI as a kind of challenger. It helps reveal the parts I haven't truly understood, rather than just speeding things up.
I’ve been around tech for a long time. At this point, I’ve lost count of how many hype cycles I’ve seen hit the “hold on, everything sucks” stage. Generative AI is seemingly at the hold on, everything sucks stage and it’s getting repetitive.
> To compute the actual speedup – or, rather, slowdown! – provided by AI tools, the researchers compared the developers’ predictions of how long each task would take to the measured completion time.
I'm sorry, but it feels to me like this research has only proven that developers tend to underestimate how long a task is supposed to take, with or without AI.
In no way did they actually measure how much faster a specific task was when performed with and without AI?
I find LLMs are decent at regurgitating boilerplate. Basically the same kind of stuff you could google then copy-paste... AI chatbots, now that they have web access, are also good at going over documentation and save you a little time searching through the docs yourself.
They're not great at business logic though, especially if you're doing anything remotely novel. Which is the difficult part of programming anyway.
But yeah, to the average corporate programmer who needs to recreate the same internal business tool that every other company has anyway, it probably saves a lot of time.
They're great at helping me figure out how to make something work with a poorly-documented, buggy framework, which is indeed a large fraction of my job, whether I like it or not.
This isn't true, and I know it by what I'm working on and sorry, I'm not at liberty to give more details. But I see how untrue this is, every working hour of every day.
I finally took the plunge and did a big chunk of work in Cursor. It was pretty ideal: greenfield but with a very relevant example to slightly modify (the example pulled events over HTTP as a server and I wanted it to pull events over Google pub/sub instead).
Over IDK, 2-3 hours I got something that seemed on its face to work, but:
- it didn't use the pub/sub API correctly
- the 1 low-coverage test it generated didn't even compile (Go)
- there were a bunch of small errors it got confused by--particularly around closures
I got it to "90%" (again though it didn't at all work) with the first prompt, and then over something like a dozen more mostly got it to fix its own errors. But:
- I didn't know the pub/sub API--I was relying on Cursor to do this correctly--and it totally submarined me
- I had to do all the digging to get the test to compile
- I had to go line by line and tell it to rewrite... almost everything
I quit when I realized I was spending more time prompting it to fix things than it would take me to fully engage my brain and fix them myself. I also noticed that there was a strong pull to "just do one more prompt" rather than dig in and actually understand things. That's super problematic to me.
Worse, this wasn't actually faster. How do I know that? The next day I did what I normally do: read docs and wrote it myself. I spent less time (I'm a fast typist and a Vim user) overall, and my code works. My experience matches pretty well w/ the results of TFA.
---
Something I will say though is there is a lot of garbage stuff in tech. Like, I don't want to learn Terraform (again) just to figure out how to deploy things to production w/o paying a Heroku-like premium. Maybe I don't want to look up recursive CTEs again, or C function pointers, or spent 2 weeks researching a heisenbug I put into code for some silly reason AI would have caught immediately. I am _confident_ we can solve these things without boiling oceans to get AI to do it for us.
But all this shit about how "I'm 20x more productive" is totally absurd. The only evidence we have of this is people just saying it. I don't think a 20x productivity increase is even imaginable. Overall productivity since 1950 is up 3.6x [0]. These people are asking us to believe they've achieved over 400 years of productivity gains in "3 months". Extraordinary claims require extraordinary evidence. My guess is either you were extremely unproductive before, or (like others are saying in the threads) in very small ways you're 20x more productive but most things are unaffected or even slower.
You're using it wrong -- it's intended to be a conversational experience. There are so many techniques you can utilize to improve the output while retaining the mental model of codebase.
Can you say more than literally "you're using it wrong"? Otherwise this is a no true scotsman (super common when LLM advocates are touting their newfound productivity). Here are my prompts, lightly redacted:
First prompt:
```
Build a new package at <path>.
Use the <blah> package at <path> as an example.
The new package should work like the <blah> package, but instead of receiving
events over HTTP, it should receive events as JSON over a Google Pub/Sub topic.
This is what one such event would look like:
{
/* some JSON */
}
```
My assumptions when I gave it the following prompt were wrong, but it didn't correct me (it actually does sometimes, so this isn't an unreasonable expectation):
```
The <method> method will only process a single message from the subscription. Modify it to continuously process any messages received from the subscription.
```
These next 2 didn't work:
```
The context object has no method WithCancel. Simply use the ctx argument to the
method above.
```
```
There's no need to attach this to the <object> object; there's also no need for this field. Remove them.
```
At this point, I fix it myself and move on.
```
There's no need to use a waitgroup in <method>, or to have that field on <object>.
Modify <method> to not use a waitgroup.
```
```
There's no need to run the logic in <object> inside an anonymous function on a
goroutine. Remove that; we only need the code inside the for loop.
```
```
Using the <package> package at <path> as an example, add metrics and logging
```
This didn't work for esoteric reasons:
```
On line 122 you're casting ctx to <context>, but that's already its type from this method's parameters. Remove this case and the error handling for when it fails.
```
...but this fixed it though:
```
Assume that ctx here is just like the ctx from <package>, for example it already has a logger.
```
There were some really basic errors in the test code. I thought I would just ask it to fix them:
```
Fix the errors in the test code.
```
That made things worse, so I just told it exactly what I wanted:
```
<field1> and <field2> are integers, just use integers
```
I wouldn't call it a "conversation" per se, but this is essentially what I see Kenton Varda, Simon Willison, et al doing.
This entire concept hinges on AI not getting better. If you believe AI is going continue to get better at the current ~5-10% a month range, then hand waiving over developer productivity today is about the same thing as writing an article about the internet being a fad in 1999.
If they do improve at 5-10% a month then that'd definitely be true (tbh I'm not sure they are even improving at that rate now - 10% for a year would be 3x improvement with compounding).
I guess the tricky bit is, nobody knows what the future looks like. "The internet is a fad" in 1999 hasn't aged well, but a lot of people touted 1960s AI, XML and 3d telivisions as things that'd be the tools in only a few years.
Really a great piece of work. At the opposite of the usual studies posted here.
With the headline we can easily guess that the study should be flawed, with the sample not representative, or developers not expert enough with the AI or so, and then, they give a very well done list of all valable objections with arguments about why they don't think that contradict the study.
That replied to all the questions I could have had in the end.
I think the dichotomy you see with how positive people are about ai has almost entirely to do with the kind of questions they ask.
That seems obvious, but a consequence of that is that people who are sceptical of ai (like me) only use it when they've exhausted other resources (like google). You ask very specific questions where not a lot of documentation is available and inevetably even o3 ends up being pretty useless.
Conversely there's people who love ai and use it for everything, and since the majority of the stuff they ask about is fairly simple and well documented (eg "Write me some typescript"), they rarely have a negative experience.
I think there are also other aspects:
- Some people simply ask a lot more questions than others (this ignores whether they like or dislike AI), i.e. some people simply prefer to find things out more by themselves, and thus also use other resources like Google or Stack Overflow as a last resort. So their questions to an AI will likely be more complicated, because they already found out the easy parts by themselves.
- If I have to make the effort to explain to the AI in a sufficiently exhaustive way what I need (which I often have to do), I expect the answers of the AI to be really good. If it isn't, having explained my problem to the AI was simply a waste of time.
> I expect the answers of the AI to be really good. If it isn't, having explained my problem to the AI was simply a waste of time.
I find the worst part to be when it doesn't correct flaws in my assumptions.
For example, yesterday I asked it "what is the difference between these two Datadog queries"? And it replied something that was semi-correct, but it didn't discover the fundamental flaw - that the first one wasn't a valid query because of unbalanced parens. In fact, it turns out that the two strings (+ another one) would get concatenated and only then would it be a valid query.
A simple "the first string is not a valid query because of a missing closing paren" would have saved a lot of time in trying to understand this, and I suspect that's what I would have received if I had prompted it with "what's the problem with this query" but LLMs are just too sycophantic to help with these things.
1 reply →
I don't think that dichotomy is true at all, at least not with experienced software people.
Many folks I know are skeptical of the hype, or maybe full on anti/distrustful, due to reasons I think are valid. But many of those same people have tried llm tools, maybe chatgpt or copilot or cursor, and recognize the value even w/ huge misgivings. Some of have gone further with tools like claude code and seen the real potential there, quite a step beyond fancy auto-complete or just-in-time agents...but even there you can end up in rabbit-holes and drowning in horrible design.
In your incredibly reductive scale, I'm closer to 'love' than 'skeptical', but I'm often much of both sides. But I'd never write a prompt like 'write me some typescript' for any real work, or honestly anything close to that, unless its just for memes or demonstrations.
But no-one who programs for a living uses prompts like that, at least not for real work. That is just silly talk.
I obviously don't mean that people literally write "write me some typescript", because nobody wants code that does something arbitrary. I'm also not saying that every reaction to ai falls between love and skeptical: I wrote a 3 sentence comment on a complex topic to sketch out an idea.
The tone of your comment suggests that my comment upset you, which wasn't my intent. But you have to try to be a little generous when you read other peoples stuff, or these discussion will get very tedious quickly.
2 replies →
I use it very routinely to generate tikz diagrams. It is obviously wrong and I need to manually tweak a little bit. But the hardest part is often to get something working at first, and in this AI is first class. It gets me 90% there, and rest is me.
I think you touched on an important aspect, but did not explore it further.
If we accept that AI is a tool, then then problem is the nature of the tool as it will vary heavily from individual to individual. This partially accounts for the ridiculous differences from self reported accounts of people, who use it on a regular basis.
And then, there is a possibility that my questions are not that unusual and/or are well documented ( quite possible ) so my perception of the usefulness of those answers is skewed.
My recent interaction with o4 was pretty decent on a very new ( by industry standards ) development and while documentation for it exists, it is a swirling vortex of insanity from where I sit. I was actually amazed to see how easily 4o saw some of those discrepancies and listed those to me along with likely pitfalls that may come with it. We will be able to find if that prediction holds v.soon.
What I am saying is that it has its uses.
The thing about tools is that they need to be predictable. I can't remember the source, but it's a concept I read that really stuck with me. A predictable tool can be used skillfully and accurately because the user can anticipate how it works and deploy it effectively. It will always be aligned with the user intent because the user decides how and when it is used.
A tool that constantly adapts to how it is used will frequently be misaligned with user intent. Language models constantly change their own behavior based on the specific phrasing you gave it, the context you deployed it in, and the inherent randomness in token generation. Its capacity to be used as a tool will be inherently limited by this unpredictability.
2 replies →
Well, I use it before google, since it in general summarizes webpages and removes the ads. Quite handy. It’s also very useful to check if you understand something correctly. And for programming specifically I found it really useful to help naming stuff (which tends to be hard not in the least place because it’s subjective).
> You ask very specific questions where not a lot of documentation is available and inevetably even o3 ends up being pretty useless.
You have any example questions where o3 failed to be helpful?
I use it pretty similarly to you, only resorting to it to unblock myself basically, otherwise I'm mostly the one doing the actual work, LLMs help with specific functions or specific blockers, or exploring new "spaces". But almost all the times I've gotten stuck, o3 (and o3-pro mode) managed to unstuck me, once I've figured out the right way to ask the question, even when my own searching and reading didn't help.
I had to create a Cython module wrapping some C, used Claude 4 and GPT 4.1, they were worse than useless. One can imagine why I needed help with that project.
I am personally somewhere in-between these two places. I've used ChatGPT to get unstuck a few times this past week because I was at the end of my rope with regards to some GPU crashes that I couldn't make heads or tails of. I then used it for less headache-inducing things and overall it's been an interesting experience.
For research I'm enjoying asking ChatGPT to annotate its responses with sources and reading those; in some cases I've found SIGGRAPH papers that I wouldn't have stumbled upon otherwise, and it's nice to get them all in a response.
ChatGPT (4o, if it's of any interest) is very knowledgeable about DirectX12 (which we switched to just this week) and I've gained tons of peripheral knowledge with regards to the things I've been battling with, but only one out of four times has it been able to actually diagnose directly what the issue was; three separate times it's been something it didn't really bring up or note in any meaningful regard. What helped was really just me writing about it, thinking about everything around it and for that it's been very helpful.
Realistically, if someone let an agent running on this stuff loose on our code base it would likely end up wasting days of time and still not fix the issue. Even worse, the results would have to be tested on a specific GPU to even trigger the issue to begin with.
It seems to me that fancy auto-complete is likely the best this would be able to do still, and I actually like it for that. I don't use LLM-assisted auto-complete anymore, but I used to use GitHub Copilot back in 2022 and it was more productive than my brief tests of agents.
If I were to regularly use LLMs for actual programmit it would most likely be just for tab-completion of "rest of expressions" or 1 line at a time, but probably with local LLMs.
It's kind of true. I only use it for simple stuff that I don't have time for. For example, how to write a simple diagram in tikz. The Ai does the simple and busywork of providing a good enough approximation which I can tweak and get what I want.
For hard questions, I prefer to use my own skills, because AI often regurgitates what I'm already aware. I still ask AI in the off-chance it comes up with something cool, but most often, I have to do it myself.
I find that in the latter case its at least a serviceable rubber duck.
What bothers me more than any of this particular discussion is that we seem to be incapable of determining programmer productivity in a meaningful way since my debut as a programmer 40 years ago.
I’m confused as to why anyone would think this would be possible to determine.
Like can we determine the productivity of doctors, lawyers, journalists, or pastry chefs?
What job out there is so simple that we can meaningfully measure all the positive and negative effects of the worker as well as account for different conditions between workers.
I could probably get behind the idea that you could measure productivity for professional poker players (given a long enough evaluation period). Hard to think of much else.
People in charge love to measure productivity and, just as harmfully, performance. The main insight people running large organisations (big business and governments) have into how they are doing is metrics, so they will use what measures they can have regardless of how meaningful they are.
The British government (probably not any worse than anyone else, just what I am most familiar with) does measure the productivity of the NHS: https://www.england.nhs.uk/long-read/nhs-productivity/ (including doctors, obviously).
They also try to measure the performance of teachers and schools and introduced performance league tables and special exams (SATS - exams sat at various ages school children in the state system, nothing like the American exams with the same name) to do this more pervasively. They made it better by creating multi-academy trusts which adds a layer of management running multi-schools so even more people want even more metrics.
The same for police, and pretty much everything else.
Yet paradoxically, the user knows instinctively. I know exactly when I'll get my next medical checkup, and when the test results will arrive. I know if a software app improves my work, and what it will cost to get a paid license.
The hard thing is occupations where the quantity of effort is unrelated to the result due to the vast number of confounding factors.
1 reply →
We can determine the productivity of factory workers, and that is still(!) how we are seen by some managers.
And to be fair, some crud work is repetitive enough so it should be possible to get a fair measure of at least the difference in speed between developers.
But that building simple crud services with rest interfaces takes as much time as it does is a failure of the tools we use.
Won't stop MBAs from trying though.
Duly upvoted! I tend to agree. Yet the shibboleth of productivity haunts us still.
[dead]
> Like can we determine the productivity of doctors, lawyers, journalists, or pastry chefs?
Yes, yes we can.
Programmers really need to stop this cope about us being such special snowflakes that we can't be assessed and that our maangers just need to take that we're worth keeping around on good faith.
18 replies →
Part of that, may be what we measure “product” to be.
My entire life, I have written “ship” software. It’s been pretty easy to say what my “product” is.
But I have also worked at a fairly small scale, in very small teams (often, only me). I was paid to manage a team, but it was a fairly small team, with highly measurable output. Personally, I have been writing software as free, open-source stuff, and it was easy to measure.
Some time ago, someone posted a story about how most software engineers have hardly ever actually shipped anything. I can’t even imagine that. I would find that incredibly depressing.
It would also make productivity pretty hard to measure. If I spent six months, working on something that never made it out of the cręche, would that mean all my work was for nothing?
Also, really experienced engineers write a lot less code (that does a lot more). They may spend four hours, writing a highly efficient 20-line method, while a less-experienced engineer might write a passable 100-line method in a couple of hours. The experienced engineers’ work might be “one and done,” never needing revision, while the less-experienced engineer’s work is a slow bug farm (loaded with million-dollar security vulnerability tech debt), which means that the productivity is actually deferred, for the more experienced engineer. Their manager may like the less-experienced engineer's work, because they make a lot more noise, doing it, are "faster," and give MOAR LINES. The "down-the-road" tech debt is of no concern to the manager.
I worked for a company that held the engineer Accountable, even if the issue appears, two years after shipping. It encouraged engineers to do their homework, and each team had a dedicated testing section, to ensure that they didn't ship bugs.
When I ask ChatGPT (for example) for a code solution, I find that it’s usually quite “naive” (pretty prolix). I usually end up rewriting it. That doesn’t mean that’s a bad thing, though. It gives me a useful “starting point,” and can save me several hours of experimenting.
> When I ask ChatGPT (for example) for a code solution, I find that it’s usually quite “naive” (pretty prolix). I usually end up rewriting it. That doesn’t mean that’s a bad thing, though. It gives me a useful “starting point,” and can save me several hours of experimenting.
The usual counter-point is that if you (commonly) write code by experimenting, you are doing it wrong. Better think the problem through, and then write decent code (that you finally turn into great code). If the code that you start with is that as "naive" as you describe, in my experience it is nearly always better to throw it away (you can't make gold out of shit) and completely start over, i.e. think the problem through and then write decent code.
2 replies →
But nevertheless, productivity objectively exists. Some people/teams are more productive as others.
I suppose it would be simpler to compare productivity for people working on standard, "normalized" tasks, but often every other task a programmer is assigned is something different to the previous one, and different developers get different tasks.
It's difficult to measure productivity based on real-world work, but we can create an artificial experiment: give N programmers the same M "normal", everyday tasks and observe whether those using AI tools complete them more quickly.
This is somewhat similar to athletic competitions — artificial in nature, yet widely accepted as a way to compare runners’ performance.
[dead]
We can determine productivity for the purpose of studies like this. Give a bunch of developers the exact same task and measure how quickly they can produce a defect-free solution. Unfortunately, this study didn’t do that – the developers chose their own tasks.
Is there any AI that can create a defect-free solution to non-trivial programming problems without supervision? This has never worked in any of my tests, so I suspect the answer is currently No.
what about the $ you make? isn't that an indicator? you've probably made more than me, so you are more successful while both of us might be doing the same thing.
I don't think there's much of a correlation there.
1 reply →
Salary is an indirect and partially useful metric, but one could argue that your ability to self-promote matters more, at least in the USA. I worked at Microsoft and saw that some of the people who made fat stacks of cash, just happened to be at the right place in the right time, or promoted things that looked good, but we’re not good for the company itself.
I made great money running my own businesses, but the vast majority of the programming was by people I hired. I’m a decent talent, but that gave me the ability to hire better ones than me.
what about the $ you generate? im a software developer consultant. we charge by the hour. up front, time and materials, and/or support hours. not too many leaps of logic to see there is a downside to completing a task too quickly or too well
i have to bill my clients and have documented around 3 weeks of development time saved by using LLMs to port other client systems to our system since December. on one hand this means we should probably update our cost estimates, but im not management so for the time ive decided to use the saved time to overdeliver on quality
eventually clients might get wise and not want to overdeliver on quality and we would charge less according to time saved by LLMs. despite a measured increase in "productivity" i would be generating less $ because my overall billable hour % decreases
hopefully overdelivering now reduces tech debt to reduce overhead and introduces new features which can increase our client pipeline to offset the eventual shift in what we charge our clients. thats about all the agency i can find in this situation
It is from a certain point of view. For example at a national level productivity is measured in GDP per hour worked. Even this is problematic - it means you increase productivity by reducing working hours or making low paid workers unemployed.
ON the other hand it makes no sense from some points of view. For example, if you get a pay rise that does not mean you are more productive.
1 reply →
In a vacuum I don’t believe pay alone is a very good indicator. What might be a better one is if someone has a history across their career of delivering working products to spec, doing this across companies and with increasing responsibility. This of course can only be determined after the fact.
Probably not, I took a new job at a significantly reduced pay because it makes me feel better and reduced stress. That fact that I can allow myself to work for less seems to me like I'm more successful.
Productivity has zero to do with salary. Case in point: FOSS.
Some of the most productive devs don't get paid by the big corps who make use of their open source projects, hence the constant urging of corps and people to sponsor projects they make money via.
People doing charity work, work for non-profits or work for public benefit corporations typically have vastly lower wages than those who work in e.g high frequency trading or other capital-adjacent industries. Are you comfortable declaring that the former is always vastly less productive than the latter?
Changing jobs typically brings a higher salary than your previous job. Are you saying that I'm significantly more productive right after changing jobs than right before?
I recently moved from being employed by a company to do software development, to running my own software development company and doing consulting work for others. I can now put in significantly fewer hours, doing the same kind of work (sometimes even on the same projects that I worked on before), and make more money. Am I now significantly more productive? I don't feel more productive, I just learned to charge more for my time.
IMO, your suggestion falls on its own ridiculousness.
Another metric could be time. Do people work less hours?
Is DB2 Admin more productive than Java dev on the same seniority?
What about countries? In my Poland $25k would be an amazing salary for a senior while in USA fresh grads can earn $80k. Are they more productive?
... at the same time, given same seniority, job and location - I'd be willing to say it wouldn't be a bad heuristic.
5 replies →
Actually, we can’t quantify most of the things we would like to optimize.
Team members always know who is productive and who isn’t, but generally don’t snitch to the management because it will be used against them or cause conflicts with colleagues. This team-level productivity doesn’t necessarily translate into something positive for a company.
Management is forced to rely on various metrics which are gamed or inaccurate.
I've been using Claude Code heavily for about 3 months now, and I'm pretty sure I'm between 10 and 20 times more productive while using it.
How I measure performance is how many features I can implement in a given period of time.
It's nice that people have done studies and have opinions, but for me, it's 10x to 20x better.
I find the swings to be wild, when you win with it, you win really big. But when you lose with it, it's a real bite out of your week too. And I think 10x to 20x has to be figurative right, you can do 20x by volume maybe, but to borrow an expression from Steve Ballmer, that's like measuring an airplane by kilograms.
Someone already operating at the very limit of their abilities doing stuff that is for them high complexity, high cognitive load, detail intense, and tactically non-obvious? Even a machine that just handed you the perfect code can't 20x your real output, even if it gave you the source file at 20x your native sophistication you wouldn't be able to build and deploy it, let alone make changes to it.
But even if it's the last 5-20% after you're already operating at your very limit and trying to hit your limit every single day is massive, it makes a bunch of stuff on the bubble go from "not realistic" to "we did that".
There are definitely swings. Last night it took about 2 hours to get Monaco into my webpack built bootstrap template, it came down to CSS being mishandled and Claude couldn't see the light. I just pasted the code into ChatGPT o3 and it fixed it first try. I pasted the output of ChatGPT into Claude and viola, all done.
A key skill is to sense when the AI is starting to guess for solutions (no different to human devs) and then either lean into another AI or reset context and start over.
I'm finding the code quality increase greatly with the addition of the text 'and please follow best practices because will be pen tested on this!' and wow.. it takes it much more seriously.
5 replies →
Let's be serious, what percentage of devs are doing "high complexity, high cognitive load, detail intense" work?
3 replies →
> Someone already operating at the very limit of their abilities doing stuff that is for them high complexity, high cognitive load, detail intense, and tactically non-obvious?
How much of the code you write is actually like this? I work in the domain of data modeling, for me once the math is worked out majority of the code is "trivial". The kind of code you are talking about is maybe 20% of my time. Honestly, also the most enjoyable 20%. I will be very happy if that is all I would work on while rest of it done by AI.
1 reply →
> Someone already operating at the very limit of their abilities doing stuff that is for them high complexity, high cognitive load, detail intense, and tactically non-obvious?
When you zoom in, even this kind of work isn't uniform - a lot of it is still shaving yaks, boring chores, and tasks that are hard dependencies for the work that is truly cognitively demanding, but themselves are easy(ish) annoyances. It's those subtasks - and the extra burden of mentally keeping track of them - that sets the limit of what even the most skilled, productive engineer can do. Offloading some of that to AI lets one free some mental capacity for work that actually benefits from that.
> Even a machine that just handed you the perfect code can't 20x your real output, even if it gave you the source file at 20x your native sophistication you wouldn't be able to build and deploy it, let alone make changes to it.
Not true if you use it right.
You're probably following the "grug developer" philosophy, as it's popular these days (as well as "but think of the juniors!", which is the perceived ideal in the current zeitgeist). By design, this turns coding into boring, low-cognitive-load work. Reviewing such code is, thus, easier (and less demoralizing) than writing it.
20x is probably a bit much across the board, but for the technical part, I can believe it - there's too much unavoidable but trivial bullshit involved in software these days (build scripts, Dockerfies, IaaS). Preventing deep context switching on those is a big time saver.
6 replies →
I agree I feel more productive. AI tools do actually make it easier and makes my brain use less energy. You would think that would be more productive but maybe it just feels that way.
Stage magicians say that the magic is done in the audiences memory after the trick is done. It's the effect of the activity.
AI coding tools makes developers happier and able to spend more brain power on actually difficult things. But overall perhaps the amount of work isn't in orders of magnitudes it just feels like it.
Waze the navigation app routes you in non standard routes so that you are not stuck in traffic, so it feels fast that you are making progress. But the time taken may be longer and the distance travelled may be further!
Being in stuck traffic and not moving even for a little bit makes you feel that time has stopped, it's boring and frustrating. Now developers need never be stuck. Their roads will be clear, but they may take longer routes.
We get little boosts of dopamine using AI tools to do stuff. Perhaps we used these signals as indicators of productivity "Ahh that days work felt good, I did a lot"
> Waze the navigation app routes you in non standard routes so that you are not stuck in traffic, so it feels fast that you are making progress. But the time taken may be longer and the distance travelled may be further!
You're not "stuck in traffic", you are the traffic. If the app distributes users around and this makes it so they don't end up in traffic jams, it's effectively preventing traffic jams from forming
I liked your washing machine vs. sink example that I see you just edited out. The machine may do it slower and less efficiently than you'd do in the sink, but the machine runs in parallel, freeing you to do something else. So is with good use of LLMs.
1 reply →
> on actually difficult things
Can't help but note that in 99% cases this "difficult things" trope makes little sense. In most jobs, the freed time is either spent on other stupid tasks or is lost due to org inefficiencies, or is just procrastinated.
> AI coding tools makes developers happier and able to spend more brain power on actually difficult things
Please don't speak for all developers when you say stuff like this
AI coding tools make me miserable to use
Where I have found Claude most helpful is on problems with very specific knowledge requirements.
Like: Why isn’t this working? Here Claude read this like 90 page PDF and tell me where I went wrong interfacing with this SDK.
Ohh I accidentally passed async_context_background_threading_safe instead of async_context_thread_safe_poll and it’s so now it’s panicking. Wow that would have taken me forever.
I can’t believe such numbers. If this was true why don’t you quit your job and vibe code 10 ios apps
I wish I could. Some problems are difficult to solve and I still need to pay the bills.
So I work 8 hours a day (to get money to eat) and code another 4 hours at home at night.
Weekends are both 10 hour days, and then rinse / repeat.
Unfortunately some projects are just hard to do and until now, they were too hard to attempt to solve solo. But with AI assistance, I am literally moving mountains.
The project may still be a failure but at least it will fail faster, no different to the pre-AI days.
12 replies →
You're getting 6 months worth of work done in a week?
I bet with a co-worker that a migration from angular 15 to angular 19 could be done really fast avoiding months. I spent a whole evening on it and Claude code have never been able to pull off a migration from 15 to 16 on its own. A total waste of time and nothing worked. I had the surprise that it cost me 275$ for nothing. So maybe for greenfield projects it’s smooth and saves time but it’s not a silver bullet on projects with problems.
5 replies →
For the sake of argument 20x means you have basically suddenly got access to 19 people with the same skill set as you.
You can build a new product company with 20 people. Probably in the same domain as you are in right now.
Output doesn't necessarily scale linearly with as you add more people. Look up mythical man.
I cringe when I see these numbers. 20 times better means that you can accomplish in two months what you would do in 4 years, which is ridiculus when said out loud. We can make it even more ridiculous by pointing out you would do in 3 years the work of working lifetime (60 years)
I am wondering, what sort of tasks are you seeing these x20 boost?
It is amazing, cringe all you want :)
I scoped out a body of work and even with the AI assisting on building cards and feature documentation, it came to about 2 to 4 weeks to implement.
It was done in 2 days.
The key I've found with working as fast as possible is to have planning sessions with Claude Code and make it challenge you and ask tons of questions. Then get it to break the work into 'cards' (think Jira, but they are just .md files in your repo) and then maintain a todo.md and done.md file pair that sorts and organizes work flow.
Then start a new context, tell it to review todo.md and pick up next task, and burn through it, when done, commit and update todo.md and done.md, /compact and you're off on the next.
It's more than AI hinting at what to do, it's a whole new way of working with rigor and structure around it. Then you just focus fire on the next card, and the next, and if you ever think up new features, then card it up and put it in the work queue.
7 replies →
You are extrapolating over years as if a programmer’s task list is consistent.
Claude code has made bootstrapping a new project, searching for API docs, troubleshooting, summarizing code, finding a GitHub project, building unit tests, refactoring, etc easily 20x faster.
It’s the context switching that is EXTREMELY expensive for a person, but costless for the LLM. I can focus on strategy (planning features) instead of being bogged down in lots of tactics (code warnings, syntax errors).
Claude Code is amazing, but the 20x gains aren’t evenly distributed. There are some projects that are too specialized (obscure languages, repos larger than the LLM’s context window, concepts that aren’t directly applicable to any codebase in their training corpus, etc). But for those of us using common languages and commodity projects, it’s a massive force multiplier.
I built my second iOS app (Swift) in about 3 days x 8 hours of vibe coding. A vocab practice app with adjustable learning profile, 3 different testing mechanisms, gamification (awards, badges), iOS notifications, text to speech, etc. My first iOS app was smaller, mostly a fork of another app, and took me 4 weeks of long days. 20x speed up with Claude Code is realistic.
And it saves even more time when researching + planning which features to add.
> in two months what you would do in 4 years
There should be a FOSS project explosion if those numbers were true by now. Commercial products too.
4 replies →
Maybe writing made up HN comments?
1 reply →
It isn’t ridiculous, it’s easily true, especially when you’re experienced in general, but have little to no knowledge of this particular big piece of tech, like say you’ve stopped doing frontend when jquery was all there was and you’re coming back. I’m doing things with react in hours I would have no business doing in weeks a couple years ago.
2 replies →
> How I measure performance is how many features I can implement in a given period of time.
When a measure becomes a target, it ceases to be a good measure.
> I'm pretty sure
So were the people taking the study. Which is why we do these, to understand where our understanding of ourselves is lacking.
Maybe you are special and do get extra gains. Or maybe you are as wrong about yourself as everyone else and are overestimating the gains you think you have.
Have any open source work you can show off?
Not the OP, but:
https://repo.autonoma.ca/notanexus.git
I don't know the PDF.js library. Writing both the client- and server-side for a PDF annotation editor would have taken 60 hours, maybe more. Instead, a combination Copilot, DeepSeek, Claude, and Gemini yielded a working prototype in under 6 hours:
https://repo.autonoma.ca/notanexus.git/tree/HEAD/src/js
I wrote maybe 3 lines of JavaScript, the rest was all prompted.
1 reply →
Unfortunately not, but ensuring the final code quality will be well written is a challenge I am putting off for now.
I'm leaning into the future growth of AI capabilities to help me here, otherwise I'll have to do it myself.
That is a tomorrow problem, too much project structure/functionality to get right first.
4 replies →
I'm between 73 and 86 times more productive using claude code. You're not using it well.
Those are rookie numbers, you gotta pump those numbers up.
Can you show some of those problems and their solutions?
Same, I’ve done stuff that should have taken me 2-3 weeks in days
I’ve done this without AI. The thing was not as hard as I thought it would be.
1 reply →
You're only getting 10x to 20x more productive? For me it's more like 10,000x to 50,000x, at minimum. YMMV.
I have exactly the same experience.
As others probably have experienced, I can only add that I am doing coding now I would have kicked down the road if I did not have LLM assistance.
Example: using LeafletJS — not hard, but I didn't want to have to search all over to figure out how to use it.
Example: other web page development requiring dropping image files, complicated scrolling, split-views, etc.
In short, there are projects I have put off in the past but eagerly begin now that LLMs are there to guide me. It's difficult to compare times and productivity in cases like that.
This is pretty similar to my own experience using LLMs as a tool.
When I'm working with platforms/languages/frameworks I am already deeply familiar with I don't think they save me much time at all. When I've tried to use them in this context they seem to save me a bunch of time in some situations, but also cost me a bunch of time in others resulting in basically a wash as far as time saved goes.
And for me a wash isn't worth the long-term cost of losing touch with the code by not being the one to have crafted it.
But when it comes to environments I'm not intimately familiar with they can provide a very easy on-ramp that is a much more pleasant experience than trying to figure things out through often iffy technical documentation or code samples.
> search all over to figure out how to use it.
Leaflet doc is single page document with examples you can copy-paste. There is page navogation at the top. Also ctrl/cmd+f and keyword seems quicker than writing the prompt.
Nice. I'm afraid I simply assumed, like other "frameworks", it was going to entail wandering all over StackOverflow, etc.
Still, when I simply told Claude that I wanted the pins to group together when zoomed out — it immediately knew I meant "clustering" and added the proper import to the top of the HTML file ... got it done.
I think this for me is the most worrying: "You can see that for AI Allowed tasks, developers spent less time researching and writing code".
My analogy to this is seeing people spend time trying to figure out how to change colors, draw shapes in powerpoint, rather than focus on the content and presentation. So here, we have developers now focusing their efforts on correcting the AI output, rather than doing the research and improving their ability to deliver code in the future.
Hmm...
I find I’m most likely to use an LLM to generate code in certain specific scenarios: (i) times I’m suffering from “writer’s block” or “having trouble getting started”; (ii) a language or framework I don’t normally use; (iii) feeling tired/burnt out/demotivated
When I’m in the “zone” I wouldn’t go near an LLM, but when I’ve fallen out of the “zone” they can be useful tools in getting me back into it, or just finishing that one extra thing before signing off for the day
I think the right answer to “does LLM use help or hinder developer productivity” is “it depends on how you use them”
It can get over some mental blocks, having some code to look at can start the idea process even it’s wrong (just like for writing). I don’t think it’s bad, like I don’t think writing throw away code for prototyping is a bad way to start a project that you aren’t sure how to tackle. Waterfall (lots of research and design up front) is still not going to work even if you forgo AI.
This has been my observation too. It's a tool for the lazy.
Us lazies need tools too!
You can say the same about a printer. Or a kindle, oh you're too lazy to carry around 5 books with you?
1 reply →
laziness is a driving force of progress
4 replies →
Here is the the methodology of the study:
> To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years. Developers provide lists of real issues (246 total) that would be valuable to the repository—bug fixes, features, and refactors that would normally be part of their regular work. Then, we randomly assign each issue to either allow or disallow use of AI while working on the issue. When AI is allowed, developers can use any tools they choose (primarily Cursor Pro with Claude 3.5/3.7 Sonnet—frontier models at the time of the study); when disallowed, they work without generative AI assistance. Developers complete these tasks (which average two hours each) while recording their screens, then self-report the total implementation time they needed. We pay developers $150/hr as compensation for their participation in the study.
So it's a small sample size of 16 developers. And it sounds like different tasks were (randomly) assigned to the no-AI and with-AI groups - so the control group doesn't have the same tasks as the experimental group. I think this could lead to some pretty noisy data.
Interestingly - small sample size isn't in the list of objections that the auther includes under "Addressing Every Objection You Thought Of, And Some You Didn’t".
I do think it's an interesting study. But would want to see if the results could be reproduced before reading into it too much.
I think the productivity gains most people rave about are stuff like, I wanted to do X which isn't hard if you are experienced with library Y and library Y is pretty popular and the LLM did it perfectly first try!
I think that's where you get 10-20x. When you're working on niche stuff it's either not gonna work or work poorly.
For example right now I need to figure out why an ffmpeg filter doesn't do X thing smoothly, even though the C code is tiny for the filter and it's self contained.. Gemini refuses to add comments to the code. It just apologizes for not being able to add comments to 150 lines of code lol.
However for building an ffmpeg pipeline in python I was dumbfounded how fast I was prototyping stuff and building fairly complex filter chains which if I had to do by hand just by reading the docs it would've taken me a whole lot more time, effort and frustration but was a joy to figure out with Gemini.
So going back to the study, IMO it's flawed because by definition working on new features for open source projects wouldn't be the bread and butter of LLMs however most people aren't working on stuff like this, they're rewriting the same code that 10000 other people have written but with their own tiny little twist or whatever.
I really think they excel at greenfield work, and are “fine” at writing code for existing systems. When you are unfamiliar with a library or a pattern it’s a huge time saver.
The sample size isn't 16 developers, it's 246 issues.
So agree with that - but on the other hand surely the number of developers matters here? For example, if instead of 16 developers the study consisted of a single developer completing all 246 tasks with or without AI, and comparing the observed times to complete, I think most people would question the reproducibility and relevancy of the study?
1 reply →
Whilst my recent experience possibly agrees with the findings, I came here to moan about the methods. Whether it's 16 or 246, that's still a miserably small sample size.
Okay, so why not 246,000 issues?
1 reply →
One thing I find frustrating with these conversations is the _strict_ focus on single-task productivity.
Arguably, on a single coding task, I don't really move that much faster. However, I have much, much more brain capacity left both while coding and when I'm done coding.
This has two knock on effects:
1. Most simply, I'm productive for longer. Since LLMs are doing a lot of the heavy lifting, my brain doesn't have to constantly think. This is especially important in time periods where I'd previously have too little mental energy to think deeply about code.
2. I can do other things while coding. Well, right now, Cursor is cycling on a simple task while I type this. Most days, though, I'm responding to customers, working on documentation/planning, or doing some other non-coding task that's critical to my workflows. This is actually where I find my biggest productivity gains. Instead of coding THEN X, I can now do coding WITH X.
> I can do other things while coding
Context shifting while trying to code seems like a bad idea to me
Maybe you're some incredible multi -tasking genius able to change tasks rapidly without losing any of the details, but I suspect if most people tried this workflow they would produce worse code and also whatever their other output is would be low quality too
The article brushed aside devs being terrible at estimates, but I dunno.
I'm a frontend guy, been using Claude Code for a couple of weeks now. It's been able to speed up some boilerplate, it's sped up a lot of "naming is hard" conversations I like to have (but my coworkers probably don't, lol), it's enabled me to do a lot more stuff in my most recent project.
But for a task or two I suspect that it has slowed me down. If I'm unable to articulate the problem well enough and the problem is hard enough you can go in circles for awhile. And I think the nature of "the right answer is just around the corner" makes it hard to timebox or find a specific point where you say "yup, time to ditch this and do it the old-fashioned way". There is a bit of a slot-machine effect here.
> But for a task or two I suspect that it has slowed me down
Likely more, as it takes longer for you to activate your brain when your first thought is to ask an LLM rather than solve it yourself. Its like people reaching for a calculator to do 4+5, that doesn't make you faster or more accurate.
Depends on if the problem is 4+5 or something much more complex.
LLMs make me 10-20x more productive in frontend work which I barely do. But when it comes to low-level stuff (C/C++) I personally don't find it too useful. it just replaces my need to search stackoverflow.
edit: should have mentioned the low-level stuff I work on is mature code and a lot of times novel.
This is good if front end is something you just need to get through. It's terrible if your work is moving to involve a lot of frontend - you'll never pick up the skills yourself
As the fullstacker with a roughly 65/35 split BE/FE on the team who has to review this kinda stuff on the daily, there's nothing I dread more than a backender writing FE tickets and vice versa.
Just last week I had to review some monstrosity of a FE ticket written by one of our backenders, with the comment of "it's 90% there, should be good to takeover". I had to throw out pretty much everything and rewrite it from scratch. My solution was like 150 lines modified, whereas the monstrous output of the AI was non-functional, ugly, a performance nightmare and around 800 lines, with extremely unhelpful and generic commit messages to the tune of "Made things great!!1!1!!".
I can't even really blame them, the C-level craze and zeal for the AI shit is such that if you're not doing crap like this you get scrutinized and PIP'd.
At least frontenders usually have some humility and will tell you they have no clue if it's a good solution or not, while BEnders are always for some reason extremely dismissive of FE work (as can be seen in this very thread). It's truly baffling to me
Interesting, I find the exact opposite. Although to a much lesser extent (maybe 50% boost).
I ended shoehorned into backend dev in Ruby/Py/Java and don't find it improves my day to day a lot.
Specifically in C, it can bang out complicated but mostly common data-structures without fault where I would surely do one-off errors. I guess since I do C for hobby I tend to solve more interesting and complicated problems like generating a whole array of dynamic C-dispatchers from a UI-library spec in JSON that allows parsing and rendering a UI specified in YAML. Gemini pro even spat out a YAML-dialect parser after a few attempts/fixes.
Maybe it's a function of familiarity and problems you end using the AI for.
As in, it seems to be best at problems that you’re unfamiliar with in domains where you have trouble judging the quality?
1 reply →
This is exactly my experience as well. I've had agents write a bit of backend code, always small parts. I'm lucky enough to be experienced enough with code I didn't write to be able to quickly debug it when it fails (and it always fails from the first run). Like using AI to write a report, it's good for outlines, but the details are always seemingly random as far as quality.
For frontend though? The stuff I really don't specialize in (despite some of my first html beginning on FrontPage 1997 back in 1997), it's a lifesaver. Just gotta be careful with prompts since so many front end frameworks are basically backend code at this point.
I've been hacking on some somewhat systemsy rust code, and I've used LLMs from a while back (early co-pilot about a year ago) on a bunch of C++ systems code.
In both of these cases, I found that just the smart auto-complete is a massive time-saver. In fact, it's more valuable to me than the interactive or agentic features.
Here's a snippet of some code that's in one of my recent buffers:
The actual code _I_ wrote were the comments. The savings in not having to type out the syntax is pretty big. About 80% of the time in manual coding would have been that. Little typos, little adjustments to get the formatting right.
The other nice benefit is that I don't have to trust the LLM. I can evaluate each snippet right there and typically the machine does a good job of picking out syntactic style and semantics from the rest of the codebase and file and applying it to the completion.
The snippet, if it's not obvious, is from a bit of compiler backend code I'm working on. I would never have even _attempted_ to write a compiler backend in my spare time without this assistance.
For experienced devs, autocomplete is good enough for massive efficiency gains in dev speed.
I still haven't warmed to the agentic interfaces because I inherently don't trust the LLMs to produce correct code reliably, so I always end up reviewing it, and reviewing greenfield code is often more work than just writing it (esp now that autocomplete is so much more useful at making that writing faster).
What exact tool are you using for your smart auto-complete?
1 reply →
It works with low-level C/C++ just fine as long as you rigorously include all relevant definitions in the context window, provide non-obvious context (like the lifecycle of some various objects) and keep your prompts focused.
Things like "apply this known algorithm to that project-specific data structure" work really well and save plenty of time. Things that require a gut feeling for how things are organized in memory don't work unless you are willing to babysit the model.
This feels like a parallel to the Gell-Mann amnesia effect.
Recently, my company has been investigating AI tools for coding. I know this sounds very late to the game, but we're a DoD consultancy and one not traditional associated with software development. So, for most of the people in the company, they are very impressed with the AI's output.
I, on the other hand, am a fairly recent addition to the company. I was specifically hired to be a "wildcard" in their usual operations. Which is too say, maybe 10 of us in a company of 3000 know what we're doing regarding software (but that's being generous because I don't really have visibility into half of the company). So, that means 99.7% of the company doesn't have the experience necessary to tell what good software development looks like.
The stuff the people using the AI are putting out is... better than what the MilOps analysts pressed into writing Python-scripts-with-delusions-of-grandeur were doing before, but by no means what I'd call quality software. I have pretty deep experience in both back end and front end. It's a step above "code written by smart people completely inexperienced in writing software that has to be maintained over a lifetime", but many steps below, "software that can successfully be maintained over a lifetime".
Well, that's what you'd expect from an LLM. They're not designed to give you the best solution. They're designed to give you the most likely solution. Which means that the results would be expected to be average, as "above average" solutions are unlikely by definition.
You can tweak the prompt a bit to skew the probability distribution with careful prompting (LLMs that are told to claim to be math PHDs are better at math problems, for instance), but in the end all of those weights in the model are spent to encode the most probable outputs.
So, it will be interesting to see how this plays out. If the average person using AI is able to produce above average code, then we could end up in a virtuous cycle where AI continuously improves with human help. On the other hand, if this just allows more low quality code to be written then the opposite happens and AI becomes more and more useless.
1 reply →
Before the industrial revolution a cabinetmaker would spend a significant amount of time advancing from apprentice to journeyman to master using only hand tools. Now master cabinetmakers that only use hand tools are exceedingly rare, most furniture is made with power tools and a related but largely different skillset.
When it comes to software the entire reason maintainability is a goal is because writing and improving software is incredibly time consuming and requires a lot of skill. It requires so much skill and time that during my decades in industry I rarely found code I would consider quality. Furthermore the output from AI tools currently may have various drawbacks, but this technology is going to keep improving year over year for the foreseeable future.
1 reply →
Same. It’s amazing for frontend.
As a front-of-the-frontend guy, I think it's terrible with CSS and SVG and just okay with HTML.
I work at a shop where we do all custom frontend work and it's just not up to the task. And, while it has chipped in on some accessibility features for me, I wouldn't trust it to do that unsupervised. Even semantic HTML is a mixed bag: if you point out something is a figure/figcaption it'll probably do it right, but I haven't found that it'll intuit these things and get it right on the first try.
But I'd imagine if you don't care about the frontend looking original or even good, and you stick really closely to something like tailwind, it could output something good enough.
And critically, I think a lot of times the hardest part of frontend work is starting, getting that first iteration out. LLMs are good for that. Actually got me over the hump on a little personal page I made a month or so ago and it was a massive help. Put out something that looked terrible but gave me what I needed to move forward.
It's astonishing. A bit scary actually. Can easily see the role of front-end slowly morphing into a single person team managing a set of AI tools. More of an architecture role.
Is this because they had the entire web to train on, code + output and semantics in every page?
7 replies →
They averaged producing 47% more code on the AI tasks, but took only 20% more time. The report here biases over these considerations, but I’m left wondering: was the extra code superfluous or did this produce better structure / managed debt better? If that extra 47% of code translates to lower debt and more consistent throughput over the long term, I might take it, given how crushed projects get from debt. Anyway, it’s all hyperbole because there are massive statistical differences in the outcomes but no measures as to what they mean, but I’m sure they have meaning. That meaning matters a ton.
> They averaged producing 47% more code on the AI tasks, but took only 20% more time. The report here biases over these considerations, but I’m left wondering: was the extra code superfluous or did this produce better structure / managed debt better? If that extra 47% of code translates to lower debt and more consistent throughput over the long term, I might take it, given how crushed projects get from debt.
Wouldn't it be the opposite? I'd expect the code would be 47% longer because it's worse and heavier in tech debt (e.g. code repeated in multiple places instead of being factored out into a function).
Honestly my experience from using AI to code (primarily claude sonnet) is that that "extra 47%" is probably itself mostly tech debt. Places where the AI repeated itself instead of using a loop. Places where the AI wrote tests that don't actually test anything. Places where the AI failed to produce a simple abstraction and instead just kept doing the same thing by hand. Etc.
AI isn't very good at being concise, in my experience. To the point of producing worse code. Which is a strange change from humans who might just have a habit of being too concise, but not by the same degree.
Your response implies the ai produced code was landed without review. That’s a possible outcome but I would hope it’s unlikely to account for the whole group at this scale. We’re of course still lacking data.
2 replies →
Can we have a linter for both high verbosity/repetitiveness and high terseness? I know copy-paste detector and cognitive complexity calculator linters are related. I recently generated code that interleaved spreadsheet worksheets (multiple of them) and cell formatting boilerplate with querying data. I asked AI to put the boilerplate into another class and expose .write_balance_row() and it did it perfectly. If a tool reported it, huge changes dont have to reach human reviewers and AIs can iterate and pass the linter.
I have an extremist take on this:
All source code is technical debt. If you increase the amount of code, you increase the amount of debt. It's impossible to reduce debt with more code. The only way to reduce debt is by reducing code.
(and note that I'm not measuring code in bytes here; switching to single-character variable names would not reduce debt. I'm measuring it in statements, expressions, instructions; reducing those without reducing functionality decreases debt)
I'll try a counterargument. If more code is more technical debt then writing more succinct code is less technical debt. But succinct code is often harder to grok and maintain than code written for the average Joe dev. So less code can sometimes mean less maintainability and thus more technical debt.
I think you instead meant to say more business logic implemented in code is more technical debt, not necessarily just more code.
1 reply →
Now do a study that specifically gauges how useful an LLM (including smart tab completion) is for a frontend dev working in react/next/tailwind on everyday Jira tickets.
These were maintainers of large open source projects. It's all relative. It's clearly providing massive gains for some and not as much for others. It should follow that it's benefit to you depends on who you are and what you are working on.
It isn't black and white.
It's a very well controlled study about... what the study claims to do. Yes, they didn't study a different thing, for _many_ reasons. Yes, we shouldn't haphazardly extrapolate to other parts of Engineering. But it looks like it's a good study nonetheless.
There are some very good findings though, like how the devs thought they were sped up but they were actually slowed down.
As a backend dev who owns a few internal crappy frontends, LLMs have been the best thing ever. Code quality isn't the top priority, I just need to plumb some data to an internal page at BigCorp.
Could you share more about your process and how they specifically help you with your internal frontends? Any details would be great! Thanks!
React and tailwind already made lot of tradeoffs to make it more ergonomic for developers. One would expect that LLMs could unlock lean and faster stack instead.
Perhaps is difficult to measure personal productivity in programming, but we can measure that we will run more slowly with 10 kg. in our backpack. I propose this procedure: The SWE selects 10 tasks and guesses some measure of their complexity (time to finish them) and then he randomly select 5 to be done with AI and the rest without. He performs them and finally calculates a deviation D. The deviation D = D_0 - D_1 where D_i = sum (real_time/guessed_time - 1), where D_0 is using AI and D_1 is without AI, the sign and magnitude of D measure respectively if the use of AI is beneficial or detrimental and the impact of using AI. Also, clipping individuals addends to be in the interval [-0.5,0.5] should avoid one bad guess dominating the estimation. Sorry if this is a trivial ideal but it is feasible and intuitively should provide useful information if the tasks are taken among the ones in which each initial guessing has small deviation. A filter should be applied to tasks in which scaffolding by AI surpass a certain relative threshold in case we are interested in generalizing our results to tasks in which scaffolding is not dominating time.
It could happen that the impact of using AI depends of the task at hand, the capability of the SWE to pair programming with it, and of the LLM used, to such an extend that those factors were bigger that the average effect on a bag of tasks, in this case the large deviation from the mean makes any one parameter estimation void of useful information.
That's pretty much what the study the article refers too did, and it found the use of AI was 19% slower.
I was surprised at how much better v0 was these days. I remember it yielding clunky UIs initially.
I thought it was the model, but then I realised, v0 is carried by the shadcn UI library, not the intelligence of the model
What if this is true? And then we as a developer community are focused on the wrong thing to increase productivity?
Like what if by focusing on LLMs for productivity we just reinforce old-bad habits, and get into a local maxima... And even worse, what if being stuck with current so-so patterns, languages, etc means we don't innovate in language design, tooling, or other areas that might actually be productivity wins?
imagine having interstate highways built in one night you wake up and you have all these highways and roads and everyone is confused what they are and how to use them. using llm is the opposite of boiling frogs because you're not the leader writing, you're just suggesting... i just realized i might not know what im talking about.
We were stuck near local maxima since before LLM's came on the scene. I figure the same concentration of innovators are gonna innovate, now LLM assisted, and the same concentration of best-practice folk are gonna best-practice--now LLM assisted. Local maxima might get sticker, but greener pastures will be found more quickly than ever.
I expect it'll balance.
Honestly the biggest hindrance of developer productivity right now is probably perpetual, looming layoffs, not lacking AI, tools, programming languages, etc :)
AI could make me more productive, I know that for a fact. But, I don't want to be more productive because the tasks that could be automated with AI are those I find enjoyable. Not always in an intellectual sense, but in a meditative sense. And if I automated those away, I think I would become less human.
I have never found a measure of programmer productivity that makes sense to me, but I can say that LLM coding tools are way more distracting to me than they are worth. They constantly guess at what I may type next, are often wrong, and pop in with suggestions breaking my mental flow and making me switch from the mindset of coding to the mindset of reviewing code.
The more I used it, the easier it became to skip over things I should have thought through myself. But looking back, the results weren’t always faster or better. Now I prefer to treat AI as a kind of challenger. It helps reveal the parts I haven't truly understood, rather than just speeding things up.
I’ve been around tech for a long time. At this point, I’ve lost count of how many hype cycles I’ve seen hit the “hold on, everything sucks” stage. Generative AI is seemingly at the hold on, everything sucks stage and it’s getting repetitive.
Trough of Disillusionment (followed by the Slope of Enlightenment and Plateau of Productivity): https://en.wikipedia.org/wiki/Gartner_hype_cycle
My bold prediction is that the Trough of Disillusionment for LLMs is going to be a very long stretch
I found that early and often code reviews can offset the reduction in productivity. A good code review process can fix this.
I have never felt so disconnected from the findings of a study
> To compute the actual speedup – or, rather, slowdown! – provided by AI tools, the researchers compared the developers’ predictions of how long each task would take to the measured completion time.
I'm sorry, but it feels to me like this research has only proven that developers tend to underestimate how long a task is supposed to take, with or without AI.
In no way did they actually measure how much faster a specific task was when performed with and without AI?
What I understand they did is.
You have two tasks:
You ask the dev to estimate both.
Then you randomly tell the dev, ok do Task 1 without AI, and Task 2 with AI.
Then you measure the actual time it took.
Their estimate for AI task missed the mark by 19%, but those without AI were done 20% faster then estimated.
At the time of estimating they didn't know if the task would need to be done with AI or not.
history repeats itself - "horses are more efficient than cars" In addition, a study based on 16 devs is representative enough to draw this conclusion?
I find LLMs are decent at regurgitating boilerplate. Basically the same kind of stuff you could google then copy-paste... AI chatbots, now that they have web access, are also good at going over documentation and save you a little time searching through the docs yourself.
They're not great at business logic though, especially if you're doing anything remotely novel. Which is the difficult part of programming anyway.
But yeah, to the average corporate programmer who needs to recreate the same internal business tool that every other company has anyway, it probably saves a lot of time.
They're great at helping me figure out how to make something work with a poorly-documented, buggy framework, which is indeed a large fraction of my job, whether I like it or not.
This isn't true, and I know it by what I'm working on and sorry, I'm not at liberty to give more details. But I see how untrue this is, every working hour of every day.
You say more details as if you gave any to begin with...
4 replies →
I finally took the plunge and did a big chunk of work in Cursor. It was pretty ideal: greenfield but with a very relevant example to slightly modify (the example pulled events over HTTP as a server and I wanted it to pull events over Google pub/sub instead).
Over IDK, 2-3 hours I got something that seemed on its face to work, but:
- it didn't use the pub/sub API correctly
- the 1 low-coverage test it generated didn't even compile (Go)
- there were a bunch of small errors it got confused by--particularly around closures
I got it to "90%" (again though it didn't at all work) with the first prompt, and then over something like a dozen more mostly got it to fix its own errors. But:
- I didn't know the pub/sub API--I was relying on Cursor to do this correctly--and it totally submarined me
- I had to do all the digging to get the test to compile
- I had to go line by line and tell it to rewrite... almost everything
I quit when I realized I was spending more time prompting it to fix things than it would take me to fully engage my brain and fix them myself. I also noticed that there was a strong pull to "just do one more prompt" rather than dig in and actually understand things. That's super problematic to me.
Worse, this wasn't actually faster. How do I know that? The next day I did what I normally do: read docs and wrote it myself. I spent less time (I'm a fast typist and a Vim user) overall, and my code works. My experience matches pretty well w/ the results of TFA.
---
Something I will say though is there is a lot of garbage stuff in tech. Like, I don't want to learn Terraform (again) just to figure out how to deploy things to production w/o paying a Heroku-like premium. Maybe I don't want to look up recursive CTEs again, or C function pointers, or spent 2 weeks researching a heisenbug I put into code for some silly reason AI would have caught immediately. I am _confident_ we can solve these things without boiling oceans to get AI to do it for us.
But all this shit about how "I'm 20x more productive" is totally absurd. The only evidence we have of this is people just saying it. I don't think a 20x productivity increase is even imaginable. Overall productivity since 1950 is up 3.6x [0]. These people are asking us to believe they've achieved over 400 years of productivity gains in "3 months". Extraordinary claims require extraordinary evidence. My guess is either you were extremely unproductive before, or (like others are saying in the threads) in very small ways you're 20x more productive but most things are unaffected or even slower.
[0]: https://fred.stlouisfed.org/series/OPHNFB
You're using it wrong -- it's intended to be a conversational experience. There are so many techniques you can utilize to improve the output while retaining the mental model of codebase.
Respectfully, this is user error.
Can you say more than literally "you're using it wrong"? Otherwise this is a no true scotsman (super common when LLM advocates are touting their newfound productivity). Here are my prompts, lightly redacted:
First prompt:
``` Build a new package at <path>. Use the <blah> package at <path> as an example. The new package should work like the <blah> package, but instead of receiving events over HTTP, it should receive events as JSON over a Google Pub/Sub topic. This is what one such event would look like:
{ /* some JSON */ } ```
My assumptions when I gave it the following prompt were wrong, but it didn't correct me (it actually does sometimes, so this isn't an unreasonable expectation):
``` The <method> method will only process a single message from the subscription. Modify it to continuously process any messages received from the subscription. ```
These next 2 didn't work:
``` The context object has no method WithCancel. Simply use the ctx argument to the method above. ```
``` There's no need to attach this to the <object> object; there's also no need for this field. Remove them. ```
At this point, I fix it myself and move on.
``` There's no need to use a waitgroup in <method>, or to have that field on <object>. Modify <method> to not use a waitgroup. ```
``` There's no need to run the logic in <object> inside an anonymous function on a goroutine. Remove that; we only need the code inside the for loop. ```
``` Using the <package> package at <path> as an example, add metrics and logging ```
This didn't work for esoteric reasons:
``` On line 122 you're casting ctx to <context>, but that's already its type from this method's parameters. Remove this case and the error handling for when it fails. ```
...but this fixed it though:
``` Assume that ctx here is just like the ctx from <package>, for example it already has a logger. ```
There were some really basic errors in the test code. I thought I would just ask it to fix them:
``` Fix the errors in the test code. ```
That made things worse, so I just told it exactly what I wanted:
``` <field1> and <field2> are integers, just use integers ```
I wouldn't call it a "conversation" per se, but this is essentially what I see Kenton Varda, Simon Willison, et al doing.
1 reply →
[dead]
[dead]
[flagged]
This entire concept hinges on AI not getting better. If you believe AI is going continue to get better at the current ~5-10% a month range, then hand waiving over developer productivity today is about the same thing as writing an article about the internet being a fad in 1999.
On the flip side, why would I use AI today if it presents no immediate benefit. Why not wait 5 years and see if it becomes actually helpful.
better yet, wait 10, let me know how it goes
If they do improve at 5-10% a month then that'd definitely be true (tbh I'm not sure they are even improving at that rate now - 10% for a year would be 3x improvement with compounding).
I guess the tricky bit is, nobody knows what the future looks like. "The internet is a fad" in 1999 hasn't aged well, but a lot of people touted 1960s AI, XML and 3d telivisions as things that'd be the tools in only a few years.
We're all just guessing till then.