Comment by steveklabnik

3 days ago

> Current LLMs

One thing that happened here is that they aren't using current LLMs:

> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

62 comments

steveklabnik

blibble 3 days ago

> One thing that happened here is that they aren't using current LLMs

I've been hearing this for 2 years now

the previous model retroactively becomes total dogshit the moment a new one is released

convenient, isn't it?

nalllar 3 days ago

If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.
If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.
This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...
steveklabnik 3 days ago
Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.
Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.
- mattmanser 3 days ago
  
  Do you really see a massive jump?
  For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.
  I'd say they're only marginally better today than they were even 2 years ago.
  Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.
  
  12 replies →
- ipaddr 3 days ago
  
  Wait until the next set. You will find you the previous ones weren't useful after all.
  
  1 reply →
bix6 3 days ago

Everything actually got better. Look at the image generation improvements as an easily visible benchmark.
I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.
I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.
mwigdahl 2 days ago
I've been a proponent for a long time, so I certainly fit this at least partially. However, the combination of Claude Code and the Claude 4 models has pushed the response to my demos of AI coding at my org from "hey, that's kind of cool" to "Wow, can you get me an API key please?"
It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.
- achierius 2 days ago
  
  Is that really the case vs. 3.7? For me that was the threshold, and since then the improvements have been nice but not as significant.
  
  1 reply →
cfst 3 days ago

The current batch of models, specifically Claude Sonnet and Opus 4, are the first I've used that have actually been more helpful than annoying on the large mixed-language codebases I work in. I suspect that dividing line differs greatly between developers and applications.
Aeolun 3 days ago

It’s true though? Previous models could do well in specifically created settings. You can throw practically everything at Opus, and it’ll work mostly fine.
simonw 3 days ago
The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.
- cwillu 3 days ago
  
  The surprise is the implication that the crossover between net-negative and net-positive impact happened to be in the last 4 months, in light of the initial release 2 years ago and sufficient public attention for a study to be funded and completed.
  Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.
  
  5 replies →
- foobarqux 3 days ago
  
  That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before; except that that is the same thing the same people say for every model release, including at the time or release of the previous one, which is now acknowledged to be seriously flawed; and including the future one, at which time the current models will similarly be acknowledged to be, not only less performant that the future models, but inherently flawed.
  Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.
  
  5 replies →
pdabbadabba 3 days ago
Maybe it's convenient. But isn't it also just a fact that some of the models available today are better than the ones available five months ago?
- bryanrasmussen 3 days ago
  
  sure, but after having spent some time trying to get anything useful - programmatically - out of previous models and not getting anything once a new one is announced how much time should one spend.
  Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.
  
  1 reply →
- Terr_ 3 days ago
  
  That's not the issue. Their complaint is that proponents keep revising what ought to be fixed goalposts... Well, fixed unless you believe unassisted human developers are also getting dramatically better at their jobs every year.
  Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.
  _____
  Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
  Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
  Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
  Apr 2025: [Ad nauseam, you get the idea]
  
  6 replies →
itsoktocry 2 days ago
>the previous model retroactively becomes total dogshit the moment a new one is released
Keep writing your code manually, nobody cares.
- player1234 15 hours ago
  
  And nobody will notice.
jstummbillig 3 days ago
Convenient for whom and what...? There is nothing tangible to gain from you believing or not believing that someone else does (or does not) get a productivity boost from AI. This is not a religion and it's not crypto. The AI users' net worth is not tied to another ones use of or stance on AI (if anything, it's the opposite).
More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.
- leshow 3 days ago
  
  I think you're missing the broader context. There is a lot of people very invested in the maximalist outcome which does create pressure for people to be boosters. You don't need a digital token for that to happen. There's a social media aspect as well that creates a feedback loop about claims.
  We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.
  
  1 reply →
- card_zero 3 days ago
  
  I saw that edit. Indeed you can't predict that rejecting a new thing is part of a routine of being wrong. It's true that "it's strange and new, therefore I hate it" is a very human (and adorable) instinct, but sometimes it's reasonable.
  
  5 replies →
- grey-area 3 days ago
  
  Honestly the hype cycle feels very like crypto, and just like crypto prominent vcs have a lot of money riding on the outcome.
  
  3 replies →