Comment by grey-area

3 days ago

Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

81 comments

grey-area

atiedebee 3 days ago

Let me bring you a third (not necessarily true) interpretation:

The developer who has experience using cursor saw a productivity increase not because he became better at using cursor, but because he became worse at not using it.

card_zero 3 days ago
Or, one person in 16 has a particular personality, inclined to LLM dependence.
- cutemonster 3 days ago
  
  Didn't they rather mean:
  Developers' own skills might atrophy, when they don't write that much code themselves, relying on AI instead.
  And now when comparing with/without AI they're faster with. But a year ago they might have been that fast or faster without an AI.
  I'm not saying that that's how things are. Just pointing out another way to interpret what GP said
- runarberg 3 days ago
  
  Invoking personality is to the behavioral science as invoking God is to the natural sciences. One can explain anything by appealing to personality, and as such it explains nothing. Psychologists have been trying to make sense of personality for over a century without much success (the best efforts so far have been a five factor model [Big 5] which has ultimately pretty minor predictive value), which is why most behavioral scientists have learned to simply leave personality to the philosophers and concentrate on much simpler theoretical framework.
  A much simpler explanation is what your parent offered. And to many behavioralists it is actually the same explanation, as to a true scotsm... [cough] behavioralist personality is simply learned habits, so—by Occam’s razor—you should omit personality from your model.
  
  4 replies →
literalAardvark 2 days ago

Became worse is possible
Became worse in 50 hours? Super unlikely

burnte 3 days ago

> Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

I would argue you don't need the "as a programming assistant" phrase as right now from my experience over the past 2 years, literally every single AI tool is massively oversold as to its utility. I've literally not seen a single one that delivers on what it's billed as capable of.

They're useful, but right now they need a lot of handholding and I don't have time for that. Too much fact checking. If I want a tool I always have to double check, I was born with a memory so I'm already good there. I don't want to have to fact check my fact checker.

LLMs are great at small tasks. The larger the single task is, or the more tasks you try to cram into one session, the worse they fall apart.

steveklabnik 3 days ago

> Current LLMs

One thing that happened here is that they aren't using current LLMs:

> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

blibble 3 days ago
> One thing that happened here is that they aren't using current LLMs
I've been hearing this for 2 years now
the previous model retroactively becomes total dogshit the moment a new one is released
convenient, isn't it?
- nalllar 3 days ago
  
  If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.
  If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.
  This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...
- steveklabnik 3 days ago
  
  Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.
  Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.
  
  15 replies →
- bix6 3 days ago
  
  Everything actually got better. Look at the image generation improvements as an easily visible benchmark.
  I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.
  I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.
- mwigdahl 3 days ago
  
  I've been a proponent for a long time, so I certainly fit this at least partially. However, the combination of Claude Code and the Claude 4 models has pushed the response to my demos of AI coding at my org from "hey, that's kind of cool" to "Wow, can you get me an API key please?"
  It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.
  
  2 replies →
- cfst 3 days ago
  
  The current batch of models, specifically Claude Sonnet and Opus 4, are the first I've used that have actually been more helpful than annoying on the large mixed-language codebases I work in. I suspect that dividing line differs greatly between developers and applications.
- Aeolun 3 days ago
  
  It’s true though? Previous models could do well in specifically created settings. You can throw practically everything at Opus, and it’ll work mostly fine.
- simonw 3 days ago
  
  The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.
  
  12 replies →
- pdabbadabba 3 days ago
  
  Maybe it's convenient. But isn't it also just a fact that some of the models available today are better than the ones available five months ago?
  
  9 replies →
- itsoktocry 2 days ago
  
  >the previous model retroactively becomes total dogshit the moment a new one is released
  Keep writing your code manually, nobody cares.
  
  1 reply →
- jstummbillig 3 days ago
  
  Convenient for whom and what...? There is nothing tangible to gain from you believing or not believing that someone else does (or does not) get a productivity boost from AI. This is not a religion and it's not crypto. The AI users' net worth is not tied to another ones use of or stance on AI (if anything, it's the opposite).
  More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.
  
  12 replies →

giantg2 3 days ago

The third option is that the person who used Cursor before had some sort of skill atrophy that led to lower unassisted speed.

I think an easy measure to help identify why a slow down is happening would be to measure how much refactoring happened on the AI generated code. Often times it seems to be missing stuff like error handling, or adds in unnecessary stuff. Of course this assumes it even had a working solution in the first place.

Terr_ 3 days ago

> people consistently predict and self-report in the wrong direction

I recall an adage about work-estimation: As chunks get too big, people unconsciously substitute "how possible does the final outcome feel" with "how long will the work take to do."

People asked "how long did it take" could be substituting something else, such as "how alone did I feel while working on it."

sandinmyjoints 3 days ago
That’s an interesting adage. Any ideas of its source?
- Dilettante_ 3 days ago
  
  It might have been in Kahneman's "Thinking, Fast and Slow"
  
  1 reply →

robwwilliams 3 days ago

Or a sampling artifact. 4 vs 12 does seem significant within a study, but consider a set of N such studies.

I assume that many large companies have tested efficiency gains and losses of there programmers much more extensively than the authors of this tiny study.

A survey of companies and their evaluation and conclusions would carry more weight—-excluding companies selling AI products, of course.

rs186 3 days ago

If you use binomial test, P(X<=4) is about 0.105 which means p = 0.21.