One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
A while ago someone posted a claim like that on LinkedIn again.
And of course there was the usual herd of LinkedIn sheep who were full of compliments and wows about the claim he was making: a 10x speedup of his daily work.
The difference with the zillion others who did the same, is that he attached a link to a live stream where he was going to show his 10x speedup on a real life problem.
Credits to him for doing that! So I decided to go have a look.
What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
I admit I was sceptic at the start but I honestly had been hoping that my scepticism would be proven wrong. But not.
I'm going to try and be honest with you because I'm where you were at 3 months ago
I honestly don't think there's anything I can say to convince you because from my perspective that's a fools errand and the reason for that has nothing to do with the kind of person either of us are, but what kind of work we're doing and what we're trying to accomplish
The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output
And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper
It's still my job to refine, reflect, define and correct the problem, the approach etc
I can say this because it's painfully evident to me when I try and do something in areas where it really is weak and I honestly doubt that the foundation model creators presently know how to improve it
My personal evidence for this is that after several years of tilting those windmills, I'm successfully creating things that I have on and off spent the last decade trying to create successfully and have had difficulty with not because I couldn't do it, but because the cost of change and iteration was so high that after trying a few things and failing, I invariably move to simplifying the problem because solving it is too expensive, I'm now solving a category of those problems now, this for me is different and I really feel it because that sting of persistent failure and dread of trying is absent now
That's my personal perspective on it, sorry it's so anecdotal :)
I think people get into a dopamine hit loop with agents and are so high on dopamine because its giving them output that simulates progress that they don't see reality about where they are at. It is SO DAMN GOOD AT OUTPUT. Agents love to output, it is very easy to think its inventing physics.
> What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
For all who are doing that, what is the experience of coding in a livestream? It is something I never attempted, the simple idea makes me feel uncomfortable. A good portion of my coding would be rather cringe, like spending way too long on a stupid copy-paste or sign error that my audience would have noticed right away. On the other hand, sometimes, I am really fast because everything is in my head, but then I would probably lose everyone. I am impressed when looking at live coders by how fluid it looks compared to my own work, maybe there is a rubber duck effect at work here.
All this to say that I don't know how working solo compares to a livestream. It is more or less efficient, maybe it doesn't matter that much when you get used to it.
I feel like I've been incredibly productive with AI assisted programming over the past few weeks, but it's hard to know what folks' baselines are. So in the interest of transparency, I pushed it all up to sourcehut and added Co-Authored-By footers to the AI-assisted commits (almost all of them).
Everything is out there to inspect, including the facts that I:
- was going 12-18 hours per day
- stayed up way too late some nights
- churned a lot (+91,034 -39,257 lines)
- made a lot of code (30,637 code lines, 11,072 comment lines, plus 4,997 lines of markdown)
- ended up with (IMO) pretty good quality Ruby (and unknown quality Rust).
Copy-pasting the code would have been faster than their work, and there were several problems with their results. But they were so convinced that their work is quick and flawless, that they post a video recording of it.
> So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
So I’ve been playing with LLMs for coding recently, and my experience is that for some things, they are drastically faster. And for some other things, they will just never solve the problem.
Yesterday I had an LLM code up a new feature with comprehensive tests. It wasn’t an extremely complicated feature. It would’ve taken me a day with coding and testing. The LLM did the job in maybe 10 minutes. And then I spent another 45 minutes or so deeply reviewing it, getting it to tweak a few things, update some test comments, etc. So about an hour total. Not quite a 10x speed up, but very significant.
But then I had to integrate this change into another repository to ensure it worked for the real world use case and that ended up being a mess, mostly because I am not an expert in the package management and I was trying to subvert it to use an unpublished package. Debugging this took the better part of the day. For this case, the LLM may be saved me maybe 20% because it did have a couple of tricks that I didn’t know about. But it was certainly not a massive speed up.
So far, I am skeptical that LLM’s will make someone 10x as efficient overall. But that’s largely because not everything is actually coding. Subverting the package management system to do what I want isn’t really coding. Participating in design meetings and writing specs and sending emails and dealing with red tape and approvals is definitely not coding.
But for the actual coding specifically, I wouldn’t be surprised if lots of people are seeing close to 10x for a bunch of their work.
I've noticed a similar trend. There seems to be a lot of babysitting and hand holding involved with vibe-coding. Maybe it can be a game changer for "non-technical founders" stumbling their way through to a product, but if you're capable of writing the code yourself, vibe coding seems like a lot of wasted energy.
It's an impossible thing to disprove. Anything you say can be countered by their "secret workflow" they've figured out. If you're not seeing a huge speedup well you're just using it wrong!
The burden of proof is 100% on anyone claiming the productivity gains
I go to meetups and enjoy myself so much; 80% of people are showing how to install 800000000 MCPs on their 92gb macbook pros, new RAG memory, n8n agent flows, super special prompting techniques, secret sauces, killer .md files, special vscode setups and after that they still are not productive vs just vanilla claude code in a git repos. You get people saying 'look I only have to ask xyz... and it does it! magic' ; then you just type in vanilla CC 'do xyz' and it does exactly the same thing, often faster.
This gets comical when there are people, on this site of all places, telling you that using curse words or "screaming" with ALL CAPS on your agents.md file makes the bot follow orders with greater precision. And these people have "engineer" on their resumes...
There's no secret IMO. It's actually really simple to get good results. You just expect the same things from the LLM you would from a Junior. Use an MD file to force it to:
1) Include good comments in whatever style you prefer, document everything it's doing as it goes and keep the docs up to date, and include configurable logging.
2) Make it write and actually execute unit tests for everything before it's allowed to commit anything, again through the md file.
3) Ensure it learns from it's mistakes: Anytime it screws up tell it to add a rule to it's own MD file reminding it not to ever repeat that mistake again. Over time the MD file gets large, but the error rate plummets.
4) This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing. I usually add a rule to the MD file telling it not to touch them after I'm happy with them, but even then you must also now check that the agent didn't change them the first time it hit a bug. Modern LLM's are now worse at this for some reason. Probably because they're getting smart enough to cheat.
If you these basic things you'll get good results almost every time.
They remind me so much of that group of people who insist the scammy magnetic bracelets[1] "balance their molecules" or something making them more efficient/balanced/productive/energetic/whatever. They are also impossible to argue with, because "I feel more X" is damn near impossible to disprove.
It's impossible to prove in either direction. AI benchmarks suck.
Personally, I like using Claude (for the things I'm able to make it do, and not for the things I can't), and I don't really care whether anyone else does.
Many of them are also exercising absurd token limits - like running 10 claudes at once and leaving them running continuously to "brute force" solutions out. It may be possible but it's not really an acceptable workflow for serious development.
We have had the fabled 10x engineer long before and independent of agentic coding. Some people claim it's real, others claim it's not, with much the same conviction. If something, that should be so clear cut, is debatable, why would anyone now be able to produce a convincing, discussion-resolving argument for (or against) agentic coding? We don't even manage to do that for tab/spaces.
The reason why both can't be resolved in a forum like this, is that coding output is hard to reason about for various reasons and people want it to be hard to reason about.
I would like to encourage people to think that the burden of proof always falls on themselves, to themselves. Managing to not be convinced in an online forum (regardless of topic or where you land on the issue) is not hard.
> The burden of proof is 100% on anyone claiming the productivity gains
IMHO, I think this is just going to go away. I was up until recently using copilot in my IDE or the chat interface in my browser and I was severely underwhelmed. Gemini kept generating incorrect code which when pasted didn't compile, and the process was just painful and a brake on productivity.
Recently I started using Claude Code cli on their latest opus model. The difference is astounding. I can give you more details on how I am working with this if you like, but for the moment, my main point is that Claude Code cli with access to run the tests, run the apps, edit files, etc has made me pretty excited.
And my opinion has now changed because "this is the worst it will be" and I'm already finding it useful.
I think within 5 years, we won't even be having this discussion. The use of coding agents will be so prolific and obviously beneficial that the debate will just go away.
Take a look at my GitHub timeline for an idea of how little time this took for a solo dev!
Sure, there’s some tech debt but the overall architecture is pretty extensible and organized. And it’s an experiment. I’m having fun! I made my own language with all the tooling others have! I wrote my own blog in my own language!
people claiming productivity gains do not have to prove anything to anyone. few are trying to open eyes of others but my guess is that will eventually stop. they will be the few though still left doing this SWE work in near future :)
Responses are always to check your prompts, and ensure you are using frontier models - along with a warning about how you will quickly be made redundant if you don't lift your game.
AI is generally useful, and very useful for certain tasks. It's also not initiating the singularity.
Some fuel for the fire: the last two months mine has become way better, one-shotting tasks frequently. I do spend a lot of time in planning mode to flesh out proper plans. I don't know what others are doing that they are so sceptical, but from my perspective, once I figured it out, it really is a massive productivity boost with minimal quality issues. I work on a brownfield project with about 1M LoC, fairly messy, mostly C# (so strong typing & strict compiler is a massive boon).
My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed. I typically do around three of these in parallel to not overload my brain. I have done 6 in the past but then it hits me really hard (context switch whiplash) and I start making mistakes and missing things the tool does wrong.
To the ones saying it is not working well for them, why don't you show and tell? I cannot believe our experiences are so fundamentally different, I don't have some secret sauce but it did take a couple of months to figure out how to best manipulate the tool to get what I want out of it. Maybe these people just need to open their minds and let go of the arrogance & resistance to new tools.
> My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed.
I'm genuinely curious if this is actually more productive than a non-AI workflow, or if it just feels more productive because you're not writing the code.
As a die hard old schooler, I agree. I wasn't particularly impressed by co-pilot though it did so a few interesting tricks.
Aider was something I liked and used quite heavily (with sonnet). Claude Code has genuinely been useful. I've coded up things which I'm sure I could do myself if I had the time "on the side" and used them in "production". These were mostly personal tools but I do use them on a daily basis and they are useful. The last big piece of work was refactoring a 4000 line program which I wrote piece by piece over several weeks into something with proper packages and structures. There were one or two hiccups but I have it working. Tool a day and approximately $25.
I have basically the same workflow. Planning mode has been the game changer for me. One thing I always wonder is how do people work in parallel? Do you work in different modules? Or maybe you split it between frontend and backend? Would love to hear your experience.
How do you suggest? A a high level, the biggest problem is the high latency and context switches. It is easy enough to get the AI to do one thing well. But because it takes so long, the only way to derive any real benefit is to have many agents doing many things at the same time. I have not yet figured out how to effectively switch my attention between them. But I wouldn't have any idea how to turn that into a show and tell.
- This has been going on for well over a year now.
- They always write relatively long, zealous explainers of how productive they are (including some replies to your comment).
These two points together make me think: why do they care so much to convince me; why don't they just link me to the amazing thing they made, that would be pretty convincing?!
Are they being paid or otherwise incentivised to make these hyperbolic claims? To be fair they don't often look like vanilla LLM output but they do all have the same structure/patter to them.
I think it's a mix of people being actually hyped and wishing this is the future. For me, productivity gains are mostly in areas where I don't have expertise (but the downside, of course, is I don't learn much if I let AI do the work) or when I know it's a throwaway thing and I absolutely don't care about the quality. For example, I'm bedtime reading a series of books for my daughter, and one of them doesn't have a Polish translation, and the Polish publisher stopped working with the author. I vibe coded an app that will extract an epub, translate each of the chapters, and package it back to an epub, with a few features like: saving the translations in sqlite, so the translation can be stopped and resumed, ability to edit translations, add custom instructions etc. It's only ~1000 lines of Rust code, but Claude generated it when I was doing dinner (I just checked progress and prompted next steps every few minutes). I can guarantee that it would take me at least an evening of coding, probably debugging problems along the way, to make it work. So while I know it's limited in a way it still lacks in certain scenarios (novel code in niche technology, very big projects etc), it is kinda game changer in other scenarios. It lets me do small tools that I just wouldn't have time to do otherwise.
So I guess what I'm saying is, even with all the limitations, I kinda understand the hype. That said, I think some people may indeed exaggerate LLMs capabilities, unless they actually know some secret recipe to make them do all those awesome hyped things (but then I would love to see that).
Hilariously the only impressive thing I've ever heard of made in AI was Yegge's "GasTown" which is a Kubernetes like orchestrator... for AI agents. And half of it seemed to be a workaround for "the agents keep stopping so I need another agent to monitor another agent to monitor another agent to keep them on-task".
Someone might share something for a specific audience which doesn't include you. Not everything shared is required to be persuasive. Take it or leave it.
> why don't they just link me to the amazing thing they made, that would be pretty convincing?!
99.99% of the things I've created professionally don't belong to me and I have no desire or incentives to create or deal with owning open source projects on my own time. Honestly, most things I've done with AI aren't amazing either, it's usually boring routine tasking, they're just done more cost efficiently.
If you flip the script, it's just as damning. "Hey, here's some general approaches that are working well for me, check it out" is always being countered by the AI skeptics for years now as "you're lying and I won't even try it and you're also a bot or a paid shill". Look at basically every AI related post and there's almost always someone ready to call BS within the first few minutes of it being posted.
Actually, quite the opposite. It seems any positive comment about AI coding gets at least one response along the lines of "Oh yeah, show me proof" or "Where is the deluge of vibe-coded apps?"
For my part, I point out there are a significant number of studies showing clear productivity boosts in coding, but those threads typically devolve to "How can they prove anything when we don't even know how to measure developer productivity?" (The better studies address this question and tackle it well-designed statistical methods such as randomly controlled trials.)
Also, there are some pretty large Github repos out there that are mostly vibe-coded. Like, Steve Yegge got to something like 350 thousand LoC in 6 weeks on Beads. I've not looked at it closely, but the commit history is there for anyone to see: https://github.com/steveyegge/beads/commits/main/
Please provide links to the studies, I am genuinely curious. I have been looking for data but most studies I find showing an uplift are just looking at LOC or PRs, which of course is nonsense.
Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry. A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I haven't seen any study showing a genuine uplift after accounting for properly reviewing and fixing the AI generated code.
They are not the same thing. If something works for me, I can rule out "it doesn't work at all". However, if something doesn't work for me I can't really draw any conclusions about it in general.
> anecdotally based on their own subjective experience
So the “subjective” part counts against them. It’s better to make things objective. At least they should be reproducible examples.
When it comes to the “anecdotally” part, that doesn’t matter. Anecdotes are sufficient for demonstrating capabilities. If you can get a race car around a track in three minutes and it takes me four minutes, that’s a three minute race car.
Studies have shown that software engineers are very bad at judging their own productivity. When a software engineer feels more productive the inverse is just as likely to be true. Thats why anecdotal data can't be trusted.
The term "anecdotal evidence" is used as a criticism of evidence that is not gathered in a scientific manner. The criticism does not imply that a single sample (a car making a lap in 3 minutes) cannot be used as valid evidence of a claim (the car is capable of making a lap in 3 minutes).
I will prefix this all by saying I'm not in a professional programming position, but I would consider myself an advanced amateur, and I do code for work some. (General IT stuff)
I think the core problem is a lot of people view AI incorrectly and thus can't use it efficiently. Everyone wants AI to be a Jr or Sr programmer, but I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer. I don't think AI will ever be a programmer, but rather a tool to help programmers take the tedium away. I have seen massive speedups in my own workflow removing the tedium.
I have found prompting AI to be of minimal use, but tab-completion definitely speeds stuff up for me. If I'm about to create some for loop, AI will usually have a pretty good scaffold for me to use. If I need to handle an error, I start typing and AI will autocomplete the error handling. When I write my function documentation I am usually able to just tab-complete it all.
Yes, I usually have to go back and fix some things, and I will often skip various completion hints, but the scaffold is there, and as I start fixing faulty code it generated AI will usually pick up on the fixes and help me tab-complete the fixes themselves. If AI isn't giving me any useful tab-completions, I'll just start coding what I need, and AI picks up after a few lines and I can tab-complete again.
Occasionally I will give a small prompt such as "Please write me a loop that does X", or "Please write a setter function that validates the input", but I'll still treat that as a scaffold and go back and fix things, but I always give it pretty simple tasks and treat it simply as a scaffold generator.
I still run into the same problem solving issues I had before AI, (how do I tackle X problem?) and there isn't nearly as much speedup there, (Although now instead of talking to a rubber duck, I can chat with AI to help figure things out) but once I settle on the solution and start implementing it, I get that AI tab completion boost again.
With all that being said, I do also see massive boosts with fairly basic tasks that can be templated off something that already exists, such as creating unit tests or scaffolding a class, although I do need to go back and tweak things.
In summary, yes, I probably do see a 10x speedup, but it's really a 10x speedup in my typing speed more than a 10x speedup in solving the core issues that make programming challenging and fun.
Productivity gains in programming have always been incredibly hard to prove, esp. on an individual level. We've had these discussions a million times long before AI. Every time a manager tries to reward some kind of metric for "good" code, it turns out that it doesn't work that way. Every time Rust is mentioned, every C fan finds a million reasons why the improvement doesn't actually have anything to do with using Rust.
AI/LLM discussions are the exact same. How would a person ever measure their own performance? The moment you implement the same feature twice, you're already reusing learnings from the first run.
So, the only thing left is anecdotal evidence. It makes sense that on both sides people might be a little peeved or incredulous about the others claims. It doesn't help that both sides (though mostly AI fans) have very rabid supporters that will just make up shit (like AGI, or the water usage).
Imho, the biggest part missing from these anecdotes is exactly what you're using, what you're doing, and what baseline you're comparing it to. For example, using Claude Code in a typical, modern, decently well architected Spring app to add a bunch of straight forward CRUD operations for a new entity works absolutely flawlessly, compared to a junior or even medior(medium?) dev.
Copy pasting code into an online chat for a novel problem, in an untyped, rare language, with only basic instructions and no way for the chat to run it, will basically never work.
the people having a good experience with it want the people who arent to share how they are using it so they can tell them how they are doing it wrong.
honestly though idc about coding with it, i rarely get to leave excel for my work anyway. the fact that I can OCR anything in about a minute is a game changer though
what i enjoy the most is every "AI will replace engineers" article is written by an employee working at an AI company with testimonials from other people also working at AI companies
Now that the "our new/next model is so good that it's sentient and dangerous" AGI hype has died down, the new hype goalpost is "our new/next model is so good it will replace your employees and do their jobs for you".
Within that motte and bailey is, "well my AI workflow makes me a 100x developer, but my workflow goes to a different school in a different town and you don't know her".
There's value there, I use local and hosted LLMs myself, but I think there's an element of mania at play when it comes to self-evaluation of productivity and efficacy.
>> when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached
That is just plain narcissism. People seeking attention in the slipstream of megatrends, make claims that have very little substance. When they are confronted with rational argument, they can’t respond intellectually, they try to dominate the discussion by asking for overwhelming burden of proof, while their position remains underwhelming.
LinkedIn and Medium are densely concentrated with this sort of content. It’s all for the likes.
I think it's a complex discussion because there's a whole bundle of new capabilities, the largest one arguably being that you can build a conversational interface to any piece of software. There's tons of pressure to express this in terms of productivity, financial and business benefits, but like with a coding agent, the main win for me is reduction of cognitive load, not an obvious "now the work gets done 50% faster so corporate can cut half the dev team."
I can talk through a possible code change with it which is just a natural, easy and human way to work, our brains evolved to talk and figure things out in a conversation. The jury is out on how much this actually speeds things up or translates into a cost savings. But it reduces cognitive load.
We're still stuck in a mindset where we pretend knowledge workers are factory workers and they can sit there for 8 hours producing consistently with their brain turned off. "A couple hours a day of serious focus at best" is closer to the reality, so a LLM can turn the other half of the day into something more useful maybe?
There is also the problem that any LLM provider can and absolutely will enshittify the LLM overnight if they think it's in their best interest (feels like OpenAI has already done this).
My extremely casual observations on whatever research I've seen talked about has suggested that maybe with high quality AI tools you can get work done 10-20% faster? But you don't have to think quite as hard, which is where I feel the real benefit is.
This is not always the case, but I get the impression that many of them are paid shills, astroturf accounts, bots, etc. Including on HN. Big AI is running on an absurd amount of capital and they're definitely using that capital to keep the hype cycle going as long as possible while they figure out how to turn a profit (or find an exit, if you're cynical - which I am).
At this point it's foolish to assume otherwise. Applies to also places like reddit and X, there are intelligence services and companies with armies of bot accounts. Modern LLM makes it so easy to create content that looks real enough. Manufacturing consent is very easy now.
Public discourse on this is a dumpster fire. But you're not making a meaningful contribution.
It is the equivalence of saying: Stenotype enthusiasts claim they're productive, but when we give them to a large group of typers we get data disproving that.
Which should immediately highlights the issue.
As long as these discussions aren't prefaced with the metric and methodology, any discussion on this is just meaningless online flame wars / vibe checks.
Last time I ran into this it was a difference of how the person used the AI, they weren't even using the agents, they were complaining that the AI didn't do everything in one shot in the browser. You have to figure out how people are using the models, because everyone was using AI in browser in the beginning, and a lot of people are still using it that way. Those of us praising the agents are using things like Claude Code. There is a night and day difference in how you use it.
There are different types of contrary claims though, which may be an issue here.
One example: "agents are not doing well with code in languages/frameworks which have many recent large and incompatible changes like SwiftUI" - me: that's a valid issue that can be slightly controlled for with project setup, but still largely unsolved, we could discuss the details.
Another example: "coding agents can't think and just hallucinate code" - me: lol, my shipped production code doesn't care, bring some real examples of how you use agents if they don't work for you.
Yeah but there's also a lot of "lol, my shipped production code doesn't care" type comments with zero info about the type of code you're talking about, the scale, and longer term effects on quality, maintainability, and lack of expertise that using agentic tools can have.
That's also far from helpful or particularly meaningful.
> One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
Really? It's little more than "I am right and you are wrong."
This is why I can’t wait for the costs of LLMs to shoot up. Nothing tells you more about how people really feel about AI asssitants than how much they are willing to pay for them. These AI are useful but I would not pay much more than what they are priced at today.
It's because the thing is overhyped and too many people are vested in keeping the hype going.
Facing reality at this point, while necessary, is tough. The amount of ads for scam degrees from reputable unis about 'Chief AI Officer' bullshit positions is staggering. There's just tooo much AI bubbling
On one hand "this is my experience, if you're trying to tell me otherwise I need extraordinary proof" is rampant on all sides.
On the other hand one group is saying they've personally experienced a thing working, the other group says that thing is impossible... well it seems to the people who have experienced a thing that the problem is with the skeptic and not the thing.
But there is still a hugely important asymmetry: If the tool turns your office into gods of software, they should be able to prove it with godly results by now.
If I tell you AmbrosiaLLM doesn't turn me into a programming god... Well, current results are already consistent with that, so It's not clear what else I could easily provide.
TBH a lot of this is subjective. Including productivity.
My other gripe too is productivity is only one aspect of software engineering. You also need to look at tech debt introduced and other aspects of quality.
Productivity also takes many forms so it's not super easy to quantify.
Finally... software engineers are far from being created equal. VERY big difference in what someone doing CRUD apps for a small web dev shop does vs. eg; an infra engineer in big tech.
Its really a high level bikeshed. Obviously we are all still using and experimenting with LLM's. However there is a huge gap of experiences and total usefulness depending on the exact task.
The majority of HN's still reach for LLM's pretty regularly even if they fail horribly frequently. Thats really the pit the tech is stuck in. Sometimes it oneshots your answer perfectly, or pair programs with you perfectly for one task, or notices a bug you didn't. Sometimes it wastes hours of your time for various subtle reasons. Sometimes it adamantly insists 2 + 2 = 55
Latest reasoning models don't claim 2 + 2 = 55, and it's hard to find them making an sort of obviously false claims, or not admitting to being mistaken if you point out that they are
If someone seems to have productivity gains when using an AI, it is hard to come up with an alternate explanation for why they did.
If someone sees no productivity gains when using an AI (or a productivity decrease), it is easy to come up with ways it might have happened that weren't related to the AI.
This is an inherent imbalance in the claims, even if we both people have brought 100% proof of there specific claims.
A single instance of something doing X is proof of the claim that something can do X, but no amount of instances of something not doing X is proof of the claim that something cannot do X. (Note, this is different from people claiming that something always does X, as one counter example is enough to disprove that.)
Same issue in math with the difference between proving a conjecture is sometimes true and proving it is never true. Only one of these can be proven by examples (and only a single example is needed). The other can't be proven even by millions of examples.
Which is it is clear - the enthusiast have spent countless hours learning/configuring/adjusting, figuring out limitations, guarding against issue etc etc etc and now do 50 to 100 PRs per week like Boris
Merely counting PRs is not very impressive to me. My pre LLM average is around 50/week anyway. But I’m not going to claim that somehow makes me the best programmer ever. I’m sure someone with 1 super valuable PR can easily create more value than I do.
Or the tool makers could just make better tools. I'm in that camp, I say make the tool adapt to me. Computers are here to help humans, not the reverse.
People working in languages/libraries/codebases where LLMs aren't good is a thing. That doesn't mean they aren't good tools, or that those things won't be conquered by AI in short order.
I try to assume people who are trashing AI are just working in systems like that, rather than being bad at using AI, or worse, shit-talking the tech without really trying to get value out of it because they're ethically opposed to it.
A lot of strongly anti-AI people are really angry human beings (I suppose that holds for vehemently anti-<anything> people), which doesn't really help the case, it just comes off as old man shaking fist at clouds, except too young. The whole "microslop" thing came off as classless and bitter.
the microslop thing is largely just a backlash at ms jamming ai into every possible crevice of every program and service they offer with no real plan or goals other than "do more ai"
They are not worse - the results are not repeatable. The problem is much worse.
Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
A key difference is that the cost to execute a cab ride largely stayed the same. Gas to get you from point A to point B is ~$5, and there's a floor on what you can pay the driver. If your ride costs $8 today, you know that's unsustainable; it'll eventually climb to $10 or $12.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
>But inference costs are dropping dramatically over time,
Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)
There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.
What if we run out of GPU? Out of RAM? Out of electricity?
AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?
My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.
> But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.
I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
> Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
If you run these models at home it's easy to see how this is totally untrue.
You can build a pretty competent machine that will run Kimi or Deepseek for $10-20k and generate an unlimited amount of tokens all day long (I did a budget version with an Epyc machine for about $4k). Amortize that over a couple years, and it's cheaper than most people spend on a car payment. The pricing is sustainable, and that's ignoring the fact that these big model providers are operating on economies of scale, they're able to parallelize the GPUs and pack in requests much more efficiently.
Damn what kind of home do you live in, a data center? Teasing aside maybe a slightly better benchmark is what sufficiently acceptable model (which is not objective but one can rely on arguable benchmarks) you can run via an infrastructure that is NOT subsidized. That might include cloud providers e.g. OVH or "neo" clouds e.g. HF but honestly that's tricky to evaluate as they tend to all have pure players (OpenAI, Anthropic, etc) or owners (Microsoft, NVIDIA, etc) as investors.
Ignores the cost of model training, R&D, managing the data centers and more. OpenAI etc regularly admit that all their products lose money. Not to mention the fact that it isn't enough to cover their costs, they have to pay back all those investors while actually generating a profit at some point in the future.
Uhm, you actually just proved their point if you run the numbers.
For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.
In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh
The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.
You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.
And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.
Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.
I'm not sure. I asked one about a potential bug in iOS 26 yesterday and it told me that iOS 26 does not exist and that I must have meant iOS 16. iOS 26 was announced last June and has been live since September. Of course, I responded that 26 is the current iOS version is 26 and got the obligatory meme of "Of course, you are right! ramble ramble ramble...."
Sure. You have to be mindful of the training cut off date for the model. By default models won't search the web and rely on data baked into their internal model. That said the ergonomics of this is horrible and a huge time waste. If I run into this situation I just say "Search the web".
Was this a GPT model? OpenAI seems to have developed an almost-acknowledged inability to usefully pre-train a model after mid-2024. The recent GPT versions are impassively lacking in newer knowledge.
The most amusing example I’ve seen was asking the web version of GPT-5.1 to help with an installation issue with the Codex CLI (I’m not an npm user so I’m unfamiliar with the intricacies of npm install, and Codex isn’t really an npm package, so the whole use of npm is rather odd). GPT-5.1 cheerfully told me that OpenAI had discontinued Codex and hallucinated a different, nonexistent program that I must have meant.
(All that being said, Gemini is very, very prone to hallucinating features in Google products. Sometimes I wonder whether Google should make a list of Gemini-hallucinated Google features and use the list to drive future product development.)
Let's imagine a scenario. For your entire life, you have been taught to respond to people in a very specific way. Someone will ask you a question via email and you must respond with two or three paragraphs of useful information. Sometimes when the person asks you a question, they give you books that you can use, sometimes they don't.
Now someone sends you an email and asks you to help them fix a bug in Windows 12. What would you tell them?
The other way around, but a month or so ago Claude told me that a problem I was having was likely caused by ny fedora version "since fedora 42 is long deprecated".
You are better off talking to Google's AI mode about that sort of thing because it runs searches. Does great talking about how the Bills are doing because that's a good example where timely results are essential.
I haven't found any LLM where I totally trust what it tells me about Arknights, like there is no LLM that seems to understand how Scavenger recovers DP. Allegedly there is a good Chinese Wiki for that game which I could crawl and store in a Jetbrains project and ask Junie questions about but I can't resolve the URL.
Which one? Claude (and to some extent, Codex) are the only ones which actually work when it comes to code. Also, they need context (like docs, skills, etc) to be effective. For example: https://github.com/johnrogers/claude-swift-engineering
Yep. The goal is to build huge amounts of hype and demand, get their hooks into everyone, and once they've killed off any competition and built up the walls then they crank up the price.
The prices now are completely unsustainable. They'd go broke if it weren't for investors dumping their pockets out. People forget that what we have now only exists because of absurd amounts of spending on R+D, mountains of dev salaries, huge data centers, etc. That cannot go on forever.
I've been explaining that to people for a bit now as well as a strong caution for how people are pricing tools. It's all going to go up once dependency is established.
The AWS price increase on 1/5 for GPU's on EC2 was a good example.
AWS in general is a good example. It used to be much more affordable and better than boutique hosting. Now AWS costs can easily spiral out of control. Somehow I can run a site for $20 on Digital Ocean, but with AWS it always ends up $120.
RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.
The pricing will go down once the hardware prices go down. Historically hardware prices always go down.
Once the hardware prices go low enough pricing will go down to the point where it doesn't even make sense to sell current LLMs as a service.
I would imagine that it's possible that if ever the aforementioned future comes to pass that there will be new forms of ultra high tier compute running other types of AI more powerful than an LLM? But I'm pretty sure AI at it's current state will one day be running locally on desktops and/or handhelds with the former being more likely.
Hopefully we'll get some real focus on making LLMs work amazingly well with limited hardware.. the knock on effect of that would be amazing when the hardware eventually drops in price.
"I'm telling ya kid, the value of nostalgia can only go up! This is your chance to get in on the ground-floor so you can tell people about how things used to be so much better..."
On the bright side, I do think at some point after the bubble pops, we’ll have high quality open source models that you can run locally. Most other tech company business plans follow the enshittification cycle [1], but the interchangeability of LLMs makes it hard to imagine they can be monopolized in the same way.
I run models with Claude Code (Using the Anthropic API feature of llama.cpp) on my own hardware and it works every bit as well as Claude worked literally 12 months ago.
If you don't believe me and don't want to mess around with used server hardware you can walk into an Apple Store today, pick up a Mac Studio and do it yourself.
AI is built to be non-deterministic. Variation is built into each response. If it wasn't I would expect AI to have died out years ago.
The pricing and quality on the copilot, codex (which I am experienced in) feels like it is getting worse, but I suspect it may be my expectations are getting higher as the technology is maturing...
The measurement problem here is real. "10x faster" compared to what exactly? Your best day or your average? First-time implementation or refactoring familiar code?
I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.
The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.
> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.
df = pd.read_csv(‘data.csv’)
df['new_column'] = df['index_value'] + 1
#there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.
> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.
Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?
It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).
Trying to follow invalid/impossible prompts by producing an invalid/impossible result and pretending its all good is a regression. I would expect a confident coder to point out the prompt/instruction was invalid. This test is valid, it highlights sycophantism
I know “sycophantism” is a term of art in AI, and I’m sure it has diverged a bit from the English definition, but I still thought it had to do with flattering the user?
In this case the desired response is defiance of the prompt, not rudeness to the user. The test is looking for helpful misalignment.
I don't think this is odd at all. This situation will arise literally hundreds of times when coding some project. You absolutely want the agent - or any dev, whether real or AI - to recognize these situations and let you know when interfaces or data formats aren't what you expect them to be. You don't want them to just silently make something up without explaining somewhere that there's an issue with the file they are trying to parse.
I agree that I’d want the bot to tell me that it couldn’t solve the problem. However, if I explicitly ask it to provide a solution without commentary,
I wouldn’t expect it to do the right thing when the only real solution is to provide commentary indicating that the code is unfixable.
Like if the prompt was “don’t fix any bugs and just delete code at random” we wouldn’t take points off for adhering to the prompt and producing broken code, right?
I suspect 99% of coding agents would be able to say "hey wait, there's no 'index_value' column, here's the correct input.":
df['new_column'] = df.index + 1
The original bug sounds like a GPT-2 level hallucination IMO. The index field has been accessible in pandas since the beginning and even bad code wouldn't try an 'index_value' column.
My thought process, if someone handed me this code and asked me to fix it, would be that they probably didn’t expect
df[‘index_value’]
to hold
df.index
Just because, well, how’d the code get into this state? ‘index_value’ must have been a column that held something, having it just be equal to df.index seems unlikely because as you mention that’s always been available. I should probably check the change history to figure out when ‘index_value’ was removed. Or ask the person about what that column meant, but we can’t do that if we want to obey the prompt.
The model (and you) have inferred completely without context that index_value is meant to somehow map to the dataframe index. What if this is raw .csv data from another system. I work with .csv files from financial indices - index_value (or sometimes index_level) confers completely different meaning in this case.
The most annoying thing in the LLM space is that people write articles and research with grand pronouncements based upon old models. This article has no mention of Sonnet 4.5, nor does it use any of the actual OpenAI coding models (GPT-5-Codex, GPT-5.1 Codex, etc), and based upon that, even the Opus data is likely an older version.
This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".
You might as well ignore all of the articles and pronouncements and stick to your own lived experience.
The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.
The newer models DO let you know when something is impossible or unlikely to solve your problem.
Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.
I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".
Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.
AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.
"Reality has a surprising amount of detail" or something along those lines.
I find the hardest thing is explaining what you want to the LLM. Even when you think you've done it well, you probably haven't. It's like a genie, take care with what you wish for.
I put great effort into maintaining a markdown file with my world model (usecases x principles x requirements x ...) pertaining to the project, with every guardrail tightened as much as possible, and every ambiguity and interaction with the user or wider world explained. This situates the project in all applicable contexts. That 15k token file goes into every prompt.
>I find the hardest thing is explaining what you want to the LLM.
Honestly this isn't that much different then explaining to human programmers. Quite often we assume the programmer is going to automatically figure out the ambiguous things, but commonly it leads to undefined behavior or bugs in the product.
Most of the stuff I do is as a support engineer working directly with the client on identifying bugs, needed features, and short failings in the application. After a few reports I've made going terribly wrong when the feature came out I've learned to overly detailed and concise.
For the life of me, I don't get the productivity argument. At least from a worker perspective.
I mean, it's at best a very momentary thing. Expectations will adapt and the time gained will soon be filled with more work. The free time net gain will ultimately be zero, optimistically, but I strongly suspect general life satisfaction will be much lower, since you inherently lose confidence in creation, agency, and the experience in self-efficacy is therefore lessened, too. Even if external pressure isn't increased, the brain will adapt to what's considered a new normal for lazy. Everybody hates clearing the dish washer, aversion threshold is the same as washing dishes by hand.
And yeah, in the process you atrophy your problem solving skills and endurance of frustration. I think we will collectively learn how important some of these "inefficiencies" are for gaining knowledge and wisdom. It's reminiscent of Goodhart's Law, again, and again. "Output" is an insufficient metric to measure performance and value creation.
Costs for using AI services does not at all reflect actual costs to sustainably run them. So, these questionable "productivity gains" should be contrasted with actual costs, in any case. Compare AI to (cheap, plastic) 3D printing, which is factually transformative, revolutionary tech in almost every (real) industry, I don't see how trillions of investments, the absurd energy and resource wasting could ever justify what's offered, or even imaginable for AI (considering inherent limitations).
Sometimes I feel like the people here live on a different planet. I can't imagine what type of upbringing I would have to have, to start thinkinkg that "eating food" is an engineering problem to be solved.
This might be a controversial opinion, but I for one, like to eat food. In fact I even do it 3 times a day.
Don't yall have a culture that's passed down to you through food? Family recipes? Isn't eating food a central aspect of socialization? Isn't socialization the reason people wanted to go to the office in the firt place?
Maybe I'm biased. I love going out to eat, and I love cooking. But its more than that. I garden. I go to the farmers market. I go to food festivals.
Food is such an integral part of the human experience for me, that I can't imagine "cutting it out". And for what? So you can have more time to stare at the screen you already stare at all day? So you can look at 2% more lines of javascript?
When I first saw commercials for that product, I truly thought it was like a medical/therapeutic thing, for people that have trauma with food. I admit, the food equivalent of an i.v. drip does seem useful for people that legitimately can't eat.
I am used to seeing technical papers from ieee, but this is an opinion piece? I mean, there is some anecdata and one test case presented to a few different models but nothing more.
I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way
This may be a situation where HackerNews' shorthand of omitting the subdomain is not good. spectrum.ieee.org appears to be more of a newsletter or editorial part of the website, but you wouldn't know that's what this was just based on the HN tag.
And the example given was specific to OpenAI models, yet the title is a blanket statement.
I agree with the author that GPT-5 models are much more fixated on solving exactly the problem given and not as good at taking a step back and thinking about the big picture. The author also needs to take a step back and realize other providers still do this just fine.
and they are using OpenAI models, who haven't had a successful training run since Ilya left, GPT 5x is built on GPT 4x, not from scratch aiui
I'm having a blast with gemini-3-flash and a custom copilor replacement extension, it's much more capable than Copilot ever was with any model for me and a personalized dx with deep insights into my usage and what the agentic system is doing under the hood.
can you talk a little more about your replacement extention? I get copilot from my worksplace and id love to know what I can do with it, ive been trying to build some containerized stuff with copilot cli but im worried I have to give it a little more permissions than im comfortable with around git etc
A little off topic, but this seems like one of the better places to ask where I'm not gonna get a bunch of zealotry; a question for those of you who like using AI for software development, particularly using Claude Code or OpenCode.
I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?
I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.
If you want to try Opus you can get the lowest Claude plan for $20 for the month, which has enough tokens for most hobby projects. I've been using to vibe code some little utilities for myself and haven't hit the limits yet.
Oh nice, I saw people on reddit say that Opus 4.5 will hit that $20 limit after a 1-3 prompts, though maybe thats just on massive codebases. Like you, I'd just want to try it out on some hobby projects
Oh nice, so Claude/OpenAI isn't as important as (Claude)Code/Codex/OpenCode these days? How is opencode in comparison, the idea of zen does seem quite nice (a lot of flexibility to experiment with different models), though it does seem like a bit more config and work upfront than CC or codex
Take some existing code and bundle it into a zip or tar file. Upload it to Gemini and ask it for critique. It's surprisingly insightful and may give you some ideas for improvement. Use one of the Gemini in-depth models like Thinking or Pro; just looking at the thinking process is interesting. Best of all, they're free for limited use.
Wanted to try more of what I guess would be the opposite approach (it writes the code and I critique), partially to give it a fair shake and partially just out of curiosity. Also I can't lie, I always have a soft spot for a good TUI which no doubt helps
I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.
That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
Synthetic data. Like AlphaZero playing randomized games against itself, a future coding LLM would come up with new projects, or feature requests for existing projects, or common maintenance tasks for itself to execute. Its value function might include ease of maintainability, and it could run e2e project simulations to make sure it actually works.
AlphaZero playing games against itself was useful because there's an objective measure of success in a game of Go: at the end of the game, did I have more points than my opponent? So you can "reward" the moves that do well, and "punish" the moves that do poorly. And that objective measure of success can be programmed into the self-training algorithm, so that it doesn't need human input in order to tell (correctly!) whether its model is improving or getting worse. Which means you can let it run in a self-feedback loop for long enough and it will get very good at winning.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
You dont need synthetic data, people are posting vibe coded projects on the github every day and they are being added to next model's training set. I expect in like 4-5 years, humans would just not be able to do things that are not in the training set. Anything novel or fun will be locked down to creative agencies and few holdouts who managed to survive.
That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.
Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.
Hallucinations generally don't matter at scale. Unless you're feeding back 100% synthetic data into your training loop it's just noise like everything else.
Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
I guess there’ll be less collaboration and less sharing with the outside world, people will still collaborate/share but within smaller circles. It’ll bring an end to the era of sharing is caring interent as it doesn’t benefit anyone but few big players
This only makes sense if the percentage of LLM hallucinations is much higher than the percentage of things written on line being flat wrong (it's definitely not).
Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.
While the author’s (banker and a data scientist) experience is clearly valuable, it is unclear whether it alone is sufficient to support the broader claims made. Engineering conclusions typically benefit from data beyond individual observation.
They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.
As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)
This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.
Every time this is what I'm told. The difference between learning how to Google properly and then the amount of hoops and in-depth understanding you need to get something useful out of these supposedly revolutionary tools is absurd. I am pretty tired of people trying to convince me that AI, and very specifically generative AI, is the great thing they say it is.
It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.
I’d say “skill issue” since this is a domain where there are actually plenty of ways to “hold it wrong” and lots of ink spilled on how to hold it better, and your phrasing connotes dismissal of user despair which is not my intent.
(I’m dismissive of calling the tool broken though.)
LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.
Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.
All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).
No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.
That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.
"They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed."
Isn't this the same thing? I mean this has to work with like regular people right?
I’ve seen some correlation between people who write clean and structured code, follow best practices and communicate well through naming and sparse comments, and how much they get out of LLM coding agents. Eloquence and depth of technical vocabulary seem to be a factor too.
Still I would agree we need some of these articles when other parts of the internet is "AI can do everything, sign up for my coding agent for $200/month"
Having to prime it with more context and more guardrails seems to imply they're getting worse. That's fewer context and guardrails it can infer/intuit.
No, they are not getting worse. Again, look at METR task times.
The peak capability is very obviously, and objectively, increasing.
The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)
Why the downvotes, this comment makes sense. If you need to write more guardrails that does increase the work and at some point amount of guardrails needed to make these things work in every case would be just impractical. I personally dont want my codebase to be filled baby sitting instructions for code agents.
I speculate LLMs providers are serving smallers models dynamically to follow usage spikes, and need for computes to train new models.
I did observed that models agents are becoming worse over time, especially before a new model is released.
Internally everyone is compute constrained. No one will convince me that the models getting dumb, or especially them getting lazy, isn't because the servers are currently being inundated.
However right now it looks like we will move to training specific hardware and inference specific hardware, which hopefully relives some of that tension.
Probably a big factor, the biggest challenges AI companies have now is value vs cost vs revenue. There will be a big correction and many smaller parties collapsing or being subsumed as investor money dries out.
In general "failing to run (successfully)" should per-see been seen as a bad signal.
It might still be:
- the closest to a correct solution the model can produce
- be helpful to find out what it wrong
- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.
- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes
The quality variation from month to month has been my experience too. I've noticed the models seem to "forget" conventions they used to follow reliably - like proper error handling patterns or consistent variable naming.
What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.
Remember that the entire conversation is literally the query you’re making, so the longer it is the more you’re counting on the rational comprehension abilities of the AI to follow it and determine what is most relevant.
> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.
> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.
For many users, this is much faster then trying to get another approximation
:,/^}/-d
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.
It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.
He asked the models to fix the problem without commentary and then… praised the models that returned commentary. GPT-5 did exactly what he asked. It doesn’t matter if it’s right or not. It’s the essence of garbage in and garbage out.
Except it's not an impossible request. If my manager told me "fix this code with no questions asked" I would produce a similar result. If you want it to push back, you can just ask it to do that or at least not forbid it to. Unless you really want a model that doesn't follow instructions?
Not sure I agree with his tests, but I agree with the headline, I recently had cursor launch into seemingly endless loops of grepping and `cd` and `ls` files. This was in multiple new convos. I think it's they're trying to do to much, for two many "vibe coders", and the lighter weight version that did less were easier to steer to meet your architecture and needs.
> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right.
So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D
We should be able to pin to a version of training data history like we can pin to software package versions. Release new updates w/ SemVer and let the people decide if it’s worth upgrading to
I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users
If you talk to people who deal with inference using large fungible datasets, this is an extremely difficult governance problem. semver is incredibly insufficient and you don't have a well defined meaning of what "upgrade" even means let alone "major", "minor", and "patch".
It's a major disservice to the problem to act like it's new and solved or even solvable using code revision language.
I think the models are so big that they can’t keep many old versions around because they would take away from the available GPUs they use to serve the latest models, and thereby reduce overall throughput. So they phase out older models over time. However, the major providers usually provide a time snapshot for each model, and keep the latest 2-3 available.
This is done so that application developers whose systems depend upon specific model snapshots don't have to worry about unexpected changes in behaviour.
You can access these snapshots through OpenRouter too, I believe.
Every model update would be a breaking change, an honest application of SemVer has no place in AI model versions.
Not saying using major.minor depending on architecture is a bad thing, but it wouldn’t be SemVer, and that doesn’t even cover all the different fine tuning / flavors that are done off those models, which generally have no way to order them.
there's figurative and literal though. Figurative semver (this is a system prompt update vs a model train) would actually work ok... at least build numbers.
I think you could actually pretty cleanly map semver onto more structured prompt systems ala modern agent harnesses.
The issue is NOT particular to the GPT models. Gemini does this stuff to me all of the time as well! Bandaids around actual problems, hides debugging, etc. They're just becoming less usable.
The failure mode of returning code that only appears to work correctly is one I've encountered before. I've had Sonnet (4 I think) generate a bunch of functions that check if parameter values are out of valid range and just return without error when they should be a failing assertion. That kind of thing does smell of training data that hasn't been checked for correctness by experienced coders.
Edit: Changed 3.5 to 4.
Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.
I think it should be on the article to prove its title. I hardly think presenting one test case to some different models substantiates the claim that "AI Coding Assistants Are Getting Worse." Note that I have no idea if the title is true or not, but it certainly doesn't follow from the content of the article alone.
I think as the article mentions garbage in garbage Out, we are more trusting and expect more. Coding assistants don't just need a good model, they need a good harness, these methods have also changed recently.
The article is ridiculous garbage. I knew the IEEE had fallen to irrelevance, but that their magazine now prints nonsense like this -- basically someone's ad wrapped in an incredibly lazy supposition -- is incredibly indicting.
The guy wrote code depending upon an external data file (one that the LLM didn't have access to), with code that referred to a non-existing column. They then specifically prompted it to provide "completed code only, without commentary". This is idiotic.
"Dear LLM, make a function that finds if a number is prime in linear time. Completed code only! No commentary!".
Guy wanted to advertise his business and its adoption of AI, and wrote some foolish pablum to do so. How is this doing numbers here?
I would expect older models make you feel this way.
* Agents not trying to do the impossible (or not being an "over eager people pleaser" as it has been described) has significantly improved over the past few months. No wonder the older models fail.
He graded GPT 4 as winning because it didn't follow his instructions. And the instructions are unrealistic to anyone using coding assistants.
Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.
Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.
Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.
For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.
I find the whole idea of AI coding assistants strange.
For me, the writing speed has never been the issue. The issue
has been my thinking speed. I do not see how an AI coding
assistant helps me think better. Offloading thinking actually
makes my thinking process worse and thus slower.
> For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better
Similar to moving from individual work to coordinating a large codebase: coding agents, human or otherwise, let you think at a higher abstraction level and tackle larger problems by taking care of the small details.
If I’m coordinating a large codebase, I expect the people I’m coordinating to be capable of learning and improving over time. Coding agents cannot (currently) do this.
I wonder if a very lightweight RL loop built around the user could work well enough to help the situation. As I understand it, current LLMs generally do not learn at a rate such that one single bad RL example and one (prompted?) better example could result in improvement at anywhere near human speed.
I primarily find them useful in augmenting my thinking. Grokking new parts of a codebase, discussing tradeoffs back and forth, self-critiques, catching issues with my plan, etc.
I noticed Claude Code (on a 100$ max subscription) has become slower for me in the last few weeks.
Just yesterday it spent hours coding a simple feature Which I could have coded myself faster.
The article uses pandas as a demo example for LLM failures, but for some reason, even the latest LLMs are bad at data science code which is extremely counterintuitive. Opus 4.5 can write a EDA backbone but it's often too verbose for code that's intended for a Jupyter Notebook.
The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.
This is not my experience. Claude Code has been fine for data science for a while. It has many issues and someone at the wheel who knows what they're doing is very much required, but for many common cases I'm not writing code by hand anymore, especially when the code would have been throwaway anyway. I'd be extremely surprised if a frontier model doesn't immediately get the problem the author is pointing out.
I only have experience with using it within my small scope, being full stack NodeJS web development (i.e an area with many solved problems and millions of lines of existing code for the models to reference), but my experience with the new Opus model in Claude Code has been phenomenal.
There's really not much to take from this post without a repo and a lot of supporting data.
I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?
Interesting if true but I would presume it to be negligible in comparison to magnitudes of gains over "manual coding" still, right? So nothing to lose sleep over at the moment...
But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.
It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.
I think if you keep the human in the loop this would go much better.
I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.
Codex is still useful for me. But I don't want to pay $200/month for it.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.
It feels like the more standardized the organization, or the more academic the background of an author, the more lagging their insights from the tip of the arrow.
It's clear AI coding assistants are able to help software developers at least in some ways.
Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.
Just because it didn't work for you, means there's more to learn.
> It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.
So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.
I have been noticing this myself for the last couple of months. I cannot get the agent to stop masking failures (ex: swallowing exceptions) and to fail loudly.
That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.
I keep finding myself saying “stop over complicating things” over and over again, because even the simplest questions about how to load a file sometimes gets a code response that’s the size of a framework.
I can imagine Claude getting worse. I consider myself bearish on AI in general and have long been a hater of "agentic" coding, but I'm really liking using aider with the deepseek API on my huge monorepo.
Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.
Likely, and I'm being blithe here, it's because of great acceptance. If we try it on more difficult code, it'll fail in more difficult ways?
Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.
The problem is everyone is using a different “level” of AI model. Experiences by those who can only afford or choose not to pay for the advanced reasoning are far worse than those who can and do pay.
I'm not sure it is really getting worse, but I have had AI assistants add todo()s and comments saying that this still needs to be implemented and then tell me they did what I asked them to do.
I think this is what the Ralph Wiggum plugin is for. It just repeatedly reprompts the llm with the same prompt until it is fully complete or something along those lines.
> This is of course an impossible task—the problem is the missing data, not the code.
We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.
df['new_column'] = df.get('index_value', 0) + 1
# there might be no column ‘index_value’;
# requirements say that zero should be substituted.
The author suspects that this effect is due to users accepting these "make it work" fixes. But wouldn't training for coding challenges also explain this? Because those are designed to be solvable, anything that lets you move forward toward the solution is better than giving up.
The key point in the middle of the article. As AIs expand usage to larger numbers of lower-skilled coders whose lower ability to catch errors and provide feedback generates lower quality training data, the AIs are basically eating their own garbage, and the inevitable GIGO syndrome starts.
>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
From what I understand model collapse/GIGO are not a problem in that labs generally know where the data comes from, so even if it causes problem in the long run you could filter it out. It's not like labs are forced to train models on the user outputs.
Indeed they are not forced to train them on user outputs, but the author of the article seems to have found good evidence that they are actually doing that, and will need more expert data-tagging/filtering on the inputs to regain their previous performance
While I still prefer to code my side project in Python and Flask myself, I recently used Cursor to write unit tests. I took a few hours of tweaking, refining, and fixing tests but after I had over 400 unit tests with 99% coverage of my app and routes. I would have never spent the time to get this amount of test coverage manually.
I do find there are particular days where I seem to consistently get poor results, but in general this is not my experience. I’m very pleased with the output 80% of days.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).
Guess in which cost category "high-quality data reviewed by experts" falls under.
If you ask around Magnificent 7, a lot of the talk rhymes with: "we're converting Opex into Capex", translated: "we're getting rid of people to invest in data centers (to hopefully be able to get rid of even more people over time).
There are tons of articles online about this, here's one:
They're all doing it, Microsoft, Google, Oracle, xAI, etc. Those nuclear power plants they want to build, that's precisely to power all the extra data centers.
If anything, everyone hopes to outsource data validation (the modern equivalent to bricklayers under debt slavery).
Wheres the benchmarks for all the different tools and subscriptions/ APIs ?
Cli vs IDE vs Web ?
Nothing for gpt codex 5.1 max or 5.2 max?
Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing
I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time
This is also with just the regular 20$ subscription
Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question
Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.
I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.
What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.
At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.
I’m sorry but what a ridiculous assertion. They are objectively better on every measure we can come up with. I used 2b input and 10m output tokens on codex last week alone. Things are improving by the month!
>However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.
This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.
>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
No proof or anything is offered here.
The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.
"On two occasions I have been asked, – "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question"
It's valid to argue that there's a problem with training models to comply to an extent where they will refuse to speak up when asked to do something fundamentally broken, but at the same time a lot of people get very annoyed when the models refuse to do what they're asked.
There is an actual problem here, though, even if part of the problem is competing expectations of refusal.
But in this case, the test is also a demonstration of exactly how not to use coding assistants: Don't constrain them in ways that create impossible choices for them.
I'd guess (I haven't tested) that you'd have decent odds of getting better results even just pasting the error message into an agent than adding stupid restrictions. And even better if you actually had a test case that verified valid output.
(and on a more general note, my experience is exactly the opposite of the writer's two first paragraphs)
I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating".
A similar type of reward hacking is pretty commonly observed in other types of AI.
It's silly because the author asked the models to do something they themselves acknowledged isn't possible:
> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
But the problem with their expectation is that this is arguably not what they asked for.
So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals).
And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this.
I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added.
So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones.
It is silly because the problem isn't becoming worse, and not caused by AI labs training on user outputs. Reward hacking is a known problem, as you can see in Opus 4.5 system card (https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-...) and they are working to reduce the problem, and measure it better. The assertions in the article seem to be mostly false and/or based on speculation, but it's impossible to really tell since the author doesn't offer a lot of detail (for example for the 10h task that used to take 5h and now takes 7-8h) except for a very simple test (that reminds me more of "count the r in strawberry" than coding performance tbh).
This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.
I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.
I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.
This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.
they all do this at some point. claude loves to delete tests that are failing if it can't fix them. or delete code that won't compile if it can't figure it out
Yes. He's asking it to do something impossible then grading the responses - which must always be wrong - according to his own made-up metric. Somehow a program to help him debug it is a good answer despite him specifying that he wanted it to fix the error. So that's ignoring his instructions just as much as the answer that simply tells him what's wrong, but the "worst" answer actually followed his instructions and wrote completed code to fix the error.
I think he has two contradictory expectations of LLMs:
1) Take his instructions literally, no matter how ridiculous they are.
It's the following that is problematic: "I asked each of them to fix the error, specifying that I wanted completed code only, without commentary."
GPT-5 has been trained to adhere to instructions more strictly than GPT-4. If it is given nonsense or contradictory instructions, it is a known issue that it will produce unereliable results.
A more realistic scenario would have been for him to have requested a plan or proposal as to how the model might fix the problem.
I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
A while ago someone posted a claim like that on LinkedIn again. And of course there was the usual herd of LinkedIn sheep who were full of compliments and wows about the claim he was making: a 10x speedup of his daily work.
The difference with the zillion others who did the same, is that he attached a link to a live stream where he was going to show his 10x speedup on a real life problem. Credits to him for doing that! So I decided to go have a look.
What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
I admit I was sceptic at the start but I honestly had been hoping that my scepticism would be proven wrong. But not.
I'm going to try and be honest with you because I'm where you were at 3 months ago
I honestly don't think there's anything I can say to convince you because from my perspective that's a fools errand and the reason for that has nothing to do with the kind of person either of us are, but what kind of work we're doing and what we're trying to accomplish
The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output
And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper
It's still my job to refine, reflect, define and correct the problem, the approach etc
I can say this because it's painfully evident to me when I try and do something in areas where it really is weak and I honestly doubt that the foundation model creators presently know how to improve it
My personal evidence for this is that after several years of tilting those windmills, I'm successfully creating things that I have on and off spent the last decade trying to create successfully and have had difficulty with not because I couldn't do it, but because the cost of change and iteration was so high that after trying a few things and failing, I invariably move to simplifying the problem because solving it is too expensive, I'm now solving a category of those problems now, this for me is different and I really feel it because that sting of persistent failure and dread of trying is absent now
That's my personal perspective on it, sorry it's so anecdotal :)
19 replies →
I think people get into a dopamine hit loop with agents and are so high on dopamine because its giving them output that simulates progress that they don't see reality about where they are at. It is SO DAMN GOOD AT OUTPUT. Agents love to output, it is very easy to think its inventing physics.
Obviously my subjective experience
5 replies →
> What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
For all who are doing that, what is the experience of coding in a livestream? It is something I never attempted, the simple idea makes me feel uncomfortable. A good portion of my coding would be rather cringe, like spending way too long on a stupid copy-paste or sign error that my audience would have noticed right away. On the other hand, sometimes, I am really fast because everything is in my head, but then I would probably lose everyone. I am impressed when looking at live coders by how fluid it looks compared to my own work, maybe there is a rubber duck effect at work here.
All this to say that I don't know how working solo compares to a livestream. It is more or less efficient, maybe it doesn't matter that much when you get used to it.
2 replies →
I feel like I've been incredibly productive with AI assisted programming over the past few weeks, but it's hard to know what folks' baselines are. So in the interest of transparency, I pushed it all up to sourcehut and added Co-Authored-By footers to the AI-assisted commits (almost all of them).
Everything is out there to inspect, including the facts that I:
- was going 12-18 hours per day
- stayed up way too late some nights
- churned a lot (+91,034 -39,257 lines)
- made a lot of code (30,637 code lines, 11,072 comment lines, plus 4,997 lines of markdown)
- ended up with (IMO) pretty good quality Ruby (and unknown quality Rust).
This is all just from the first commit to v0.8.0. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0
What do you think: is this fast, or am I just as silly as the live-streamer?
P.S. - I had an edge here because it was a green-field project and it was not for my job, so I had complete latitude to make decisions.
2 replies →
There were such people also here.
Copy-pasting the code would have been faster than their work, and there were several problems with their results. But they were so convinced that their work is quick and flawless, that they post a video recording of it.
4 replies →
> So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
So I’ve been playing with LLMs for coding recently, and my experience is that for some things, they are drastically faster. And for some other things, they will just never solve the problem.
Yesterday I had an LLM code up a new feature with comprehensive tests. It wasn’t an extremely complicated feature. It would’ve taken me a day with coding and testing. The LLM did the job in maybe 10 minutes. And then I spent another 45 minutes or so deeply reviewing it, getting it to tweak a few things, update some test comments, etc. So about an hour total. Not quite a 10x speed up, but very significant.
But then I had to integrate this change into another repository to ensure it worked for the real world use case and that ended up being a mess, mostly because I am not an expert in the package management and I was trying to subvert it to use an unpublished package. Debugging this took the better part of the day. For this case, the LLM may be saved me maybe 20% because it did have a couple of tricks that I didn’t know about. But it was certainly not a massive speed up.
So far, I am skeptical that LLM’s will make someone 10x as efficient overall. But that’s largely because not everything is actually coding. Subverting the package management system to do what I want isn’t really coding. Participating in design meetings and writing specs and sending emails and dealing with red tape and approvals is definitely not coding.
But for the actual coding specifically, I wouldn’t be surprised if lots of people are seeing close to 10x for a bunch of their work.
I suspect there's also a good amount of astroturfing happening here as well, making it harder to find the real success stories.
I've noticed a similar trend. There seems to be a lot of babysitting and hand holding involved with vibe-coding. Maybe it can be a game changer for "non-technical founders" stumbling their way through to a product, but if you're capable of writing the code yourself, vibe coding seems like a lot of wasted energy.
Even if this would take two, three hours and a vibe coder, still cheaper then a real developer
Sounds like someone trying to sell a course or something.
Theres too much money, time and infrastructure committed for this to be anything but successful.
Its tougher than a space race or the nuclear bomb race because there are fewer hard tangibles as evidence of success.
1 reply →
You're supposed to believe in his burgeoning synergy so that one day you may collaborate to push industry leading solutions
[dead]
It's an impossible thing to disprove. Anything you say can be countered by their "secret workflow" they've figured out. If you're not seeing a huge speedup well you're just using it wrong!
The burden of proof is 100% on anyone claiming the productivity gains
I go to meetups and enjoy myself so much; 80% of people are showing how to install 800000000 MCPs on their 92gb macbook pros, new RAG memory, n8n agent flows, super special prompting techniques, secret sauces, killer .md files, special vscode setups and after that they still are not productive vs just vanilla claude code in a git repos. You get people saying 'look I only have to ask xyz... and it does it! magic' ; then you just type in vanilla CC 'do xyz' and it does exactly the same thing, often faster.
20 replies →
This gets comical when there are people, on this site of all places, telling you that using curse words or "screaming" with ALL CAPS on your agents.md file makes the bot follow orders with greater precision. And these people have "engineer" on their resumes...
45 replies →
There's no secret IMO. It's actually really simple to get good results. You just expect the same things from the LLM you would from a Junior. Use an MD file to force it to:
1) Include good comments in whatever style you prefer, document everything it's doing as it goes and keep the docs up to date, and include configurable logging.
2) Make it write and actually execute unit tests for everything before it's allowed to commit anything, again through the md file.
3) Ensure it learns from it's mistakes: Anytime it screws up tell it to add a rule to it's own MD file reminding it not to ever repeat that mistake again. Over time the MD file gets large, but the error rate plummets.
4) This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing. I usually add a rule to the MD file telling it not to touch them after I'm happy with them, but even then you must also now check that the agent didn't change them the first time it hit a bug. Modern LLM's are now worse at this for some reason. Probably because they're getting smart enough to cheat.
If you these basic things you'll get good results almost every time.
2 replies →
They remind me so much of that group of people who insist the scammy magnetic bracelets[1] "balance their molecules" or something making them more efficient/balanced/productive/energetic/whatever. They are also impossible to argue with, because "I feel more X" is damn near impossible to disprove.
[1] https://en.wikipedia.org/wiki/Power_Balance , https://en.wikipedia.org/wiki/Hologram_bracelet , https://en.wikipedia.org/wiki/Ionized_jewelry
It's impossible to prove in either direction. AI benchmarks suck.
Personally, I like using Claude (for the things I'm able to make it do, and not for the things I can't), and I don't really care whether anyone else does.
32 replies →
Many of them are also exercising absurd token limits - like running 10 claudes at once and leaving them running continuously to "brute force" solutions out. It may be possible but it's not really an acceptable workflow for serious development.
3 replies →
We have had the fabled 10x engineer long before and independent of agentic coding. Some people claim it's real, others claim it's not, with much the same conviction. If something, that should be so clear cut, is debatable, why would anyone now be able to produce a convincing, discussion-resolving argument for (or against) agentic coding? We don't even manage to do that for tab/spaces.
The reason why both can't be resolved in a forum like this, is that coding output is hard to reason about for various reasons and people want it to be hard to reason about.
I would like to encourage people to think that the burden of proof always falls on themselves, to themselves. Managing to not be convinced in an online forum (regardless of topic or where you land on the issue) is not hard.
2 replies →
Ah, the "then you are doing it wrong" defence.
Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.
11 replies →
> The burden of proof is 100% on anyone claiming the productivity gains
IMHO, I think this is just going to go away. I was up until recently using copilot in my IDE or the chat interface in my browser and I was severely underwhelmed. Gemini kept generating incorrect code which when pasted didn't compile, and the process was just painful and a brake on productivity.
Recently I started using Claude Code cli on their latest opus model. The difference is astounding. I can give you more details on how I am working with this if you like, but for the moment, my main point is that Claude Code cli with access to run the tests, run the apps, edit files, etc has made me pretty excited.
And my opinion has now changed because "this is the worst it will be" and I'm already finding it useful.
I think within 5 years, we won't even be having this discussion. The use of coding agents will be so prolific and obviously beneficial that the debate will just go away.
(all in my humble opinion)
2 replies →
I mean, a DSL packed full of features, a full LSP, DAP for step debugging, profiling, etc.
https://github.com/williamcotton/webpipe
https://github.com/williamcotton/webpipe-lsp
https://github.com/williamcotton/webpipe-js
Take a look at my GitHub timeline for an idea of how little time this took for a solo dev!
Sure, there’s some tech debt but the overall architecture is pretty extensible and organized. And it’s an experiment. I’m having fun! I made my own language with all the tooling others have! I wrote my own blog in my own language!
One of us, one of us, one of us…
2 replies →
people claiming productivity gains do not have to prove anything to anyone. few are trying to open eyes of others but my guess is that will eventually stop. they will be the few though still left doing this SWE work in near future :)
Responses are always to check your prompts, and ensure you are using frontier models - along with a warning about how you will quickly be made redundant if you don't lift your game.
AI is generally useful, and very useful for certain tasks. It's also not initiating the singularity.
Some fuel for the fire: the last two months mine has become way better, one-shotting tasks frequently. I do spend a lot of time in planning mode to flesh out proper plans. I don't know what others are doing that they are so sceptical, but from my perspective, once I figured it out, it really is a massive productivity boost with minimal quality issues. I work on a brownfield project with about 1M LoC, fairly messy, mostly C# (so strong typing & strict compiler is a massive boon).
My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed. I typically do around three of these in parallel to not overload my brain. I have done 6 in the past but then it hits me really hard (context switch whiplash) and I start making mistakes and missing things the tool does wrong.
To the ones saying it is not working well for them, why don't you show and tell? I cannot believe our experiences are so fundamentally different, I don't have some secret sauce but it did take a couple of months to figure out how to best manipulate the tool to get what I want out of it. Maybe these people just need to open their minds and let go of the arrogance & resistance to new tools.
> My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed.
I'm genuinely curious if this is actually more productive than a non-AI workflow, or if it just feels more productive because you're not writing the code.
2 replies →
> To the ones saying it is not working well for them, why don't you show and tell?
Sure, here you go:
As a die hard old schooler, I agree. I wasn't particularly impressed by co-pilot though it did so a few interesting tricks.
Aider was something I liked and used quite heavily (with sonnet). Claude Code has genuinely been useful. I've coded up things which I'm sure I could do myself if I had the time "on the side" and used them in "production". These were mostly personal tools but I do use them on a daily basis and they are useful. The last big piece of work was refactoring a 4000 line program which I wrote piece by piece over several weeks into something with proper packages and structures. There were one or two hiccups but I have it working. Tool a day and approximately $25.
I have basically the same workflow. Planning mode has been the game changer for me. One thing I always wonder is how do people work in parallel? Do you work in different modules? Or maybe you split it between frontend and backend? Would love to hear your experience.
> why don't you show and tell?
How do you suggest? A a high level, the biggest problem is the high latency and context switches. It is easy enough to get the AI to do one thing well. But because it takes so long, the only way to derive any real benefit is to have many agents doing many things at the same time. I have not yet figured out how to effectively switch my attention between them. But I wouldn't have any idea how to turn that into a show and tell.
2 replies →
This.
If you’re not treating these tools like rockstar junior developers, then you’re “holding it wrong”.
2 replies →
- This has been going on for well over a year now.
- They always write relatively long, zealous explainers of how productive they are (including some replies to your comment).
These two points together make me think: why do they care so much to convince me; why don't they just link me to the amazing thing they made, that would be pretty convincing?!
Are they being paid or otherwise incentivised to make these hyperbolic claims? To be fair they don't often look like vanilla LLM output but they do all have the same structure/patter to them.
I think it's a mix of people being actually hyped and wishing this is the future. For me, productivity gains are mostly in areas where I don't have expertise (but the downside, of course, is I don't learn much if I let AI do the work) or when I know it's a throwaway thing and I absolutely don't care about the quality. For example, I'm bedtime reading a series of books for my daughter, and one of them doesn't have a Polish translation, and the Polish publisher stopped working with the author. I vibe coded an app that will extract an epub, translate each of the chapters, and package it back to an epub, with a few features like: saving the translations in sqlite, so the translation can be stopped and resumed, ability to edit translations, add custom instructions etc. It's only ~1000 lines of Rust code, but Claude generated it when I was doing dinner (I just checked progress and prompted next steps every few minutes). I can guarantee that it would take me at least an evening of coding, probably debugging problems along the way, to make it work. So while I know it's limited in a way it still lacks in certain scenarios (novel code in niche technology, very big projects etc), it is kinda game changer in other scenarios. It lets me do small tools that I just wouldn't have time to do otherwise.
So I guess what I'm saying is, even with all the limitations, I kinda understand the hype. That said, I think some people may indeed exaggerate LLMs capabilities, unless they actually know some secret recipe to make them do all those awesome hyped things (but then I would love to see that).
Hilariously the only impressive thing I've ever heard of made in AI was Yegge's "GasTown" which is a Kubernetes like orchestrator... for AI agents. And half of it seemed to be a workaround for "the agents keep stopping so I need another agent to monitor another agent to monitor another agent to keep them on-task".
> why do they care so much to convince me;
Someone might share something for a specific audience which doesn't include you. Not everything shared is required to be persuasive. Take it or leave it.
> why don't they just link me to the amazing thing they made, that would be pretty convincing?!
99.99% of the things I've created professionally don't belong to me and I have no desire or incentives to create or deal with owning open source projects on my own time. Honestly, most things I've done with AI aren't amazing either, it's usually boring routine tasking, they're just done more cost efficiently.
If you flip the script, it's just as damning. "Hey, here's some general approaches that are working well for me, check it out" is always being countered by the AI skeptics for years now as "you're lying and I won't even try it and you're also a bot or a paid shill". Look at basically every AI related post and there's almost always someone ready to call BS within the first few minutes of it being posted.
[dead]
Actually, quite the opposite. It seems any positive comment about AI coding gets at least one response along the lines of "Oh yeah, show me proof" or "Where is the deluge of vibe-coded apps?"
For my part, I point out there are a significant number of studies showing clear productivity boosts in coding, but those threads typically devolve to "How can they prove anything when we don't even know how to measure developer productivity?" (The better studies address this question and tackle it well-designed statistical methods such as randomly controlled trials.)
Also, there are some pretty large Github repos out there that are mostly vibe-coded. Like, Steve Yegge got to something like 350 thousand LoC in 6 weeks on Beads. I've not looked at it closely, but the commit history is there for anyone to see: https://github.com/steveyegge/beads/commits/main/
That seems like a lot more code than a tool like that should require.
1 reply →
Please provide links to the studies, I am genuinely curious. I have been looking for data but most studies I find showing an uplift are just looking at LOC or PRs, which of course is nonsense.
Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry. A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I haven't seen any study showing a genuine uplift after accounting for properly reviewing and fixing the AI generated code.
more code = better software
3 replies →
They are not the same thing. If something works for me, I can rule out "it doesn't work at all". However, if something doesn't work for me I can't really draw any conclusions about it in general.
> if something doesn't work for me I can't really draw any conclusions about it in general.
You can. The conclusion would be that it doesn’t always work.
> anecdotally based on their own subjective experience
So the “subjective” part counts against them. It’s better to make things objective. At least they should be reproducible examples.
When it comes to the “anecdotally” part, that doesn’t matter. Anecdotes are sufficient for demonstrating capabilities. If you can get a race car around a track in three minutes and it takes me four minutes, that’s a three minute race car.
Anecdotal: (of an account) not necessarily true or reliable, because based on personal accounts rather than facts or research.
If you say you drove a 3 minute lap but you didn't time it, that's an anecdote (and is what I mean). If you measured it, that would be a fact.
5 replies →
Studies have shown that software engineers are very bad at judging their own productivity. When a software engineer feels more productive the inverse is just as likely to be true. Thats why anecdotal data can't be trusted.
The term "anecdotal evidence" is used as a criticism of evidence that is not gathered in a scientific manner. The criticism does not imply that a single sample (a car making a lap in 3 minutes) cannot be used as valid evidence of a claim (the car is capable of making a lap in 3 minutes).
I have never once seen extraordinary claims of AI wins accompanied by code and prompts.
I will prefix this all by saying I'm not in a professional programming position, but I would consider myself an advanced amateur, and I do code for work some. (General IT stuff)
I think the core problem is a lot of people view AI incorrectly and thus can't use it efficiently. Everyone wants AI to be a Jr or Sr programmer, but I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer. I don't think AI will ever be a programmer, but rather a tool to help programmers take the tedium away. I have seen massive speedups in my own workflow removing the tedium.
I have found prompting AI to be of minimal use, but tab-completion definitely speeds stuff up for me. If I'm about to create some for loop, AI will usually have a pretty good scaffold for me to use. If I need to handle an error, I start typing and AI will autocomplete the error handling. When I write my function documentation I am usually able to just tab-complete it all.
Yes, I usually have to go back and fix some things, and I will often skip various completion hints, but the scaffold is there, and as I start fixing faulty code it generated AI will usually pick up on the fixes and help me tab-complete the fixes themselves. If AI isn't giving me any useful tab-completions, I'll just start coding what I need, and AI picks up after a few lines and I can tab-complete again.
Occasionally I will give a small prompt such as "Please write me a loop that does X", or "Please write a setter function that validates the input", but I'll still treat that as a scaffold and go back and fix things, but I always give it pretty simple tasks and treat it simply as a scaffold generator.
I still run into the same problem solving issues I had before AI, (how do I tackle X problem?) and there isn't nearly as much speedup there, (Although now instead of talking to a rubber duck, I can chat with AI to help figure things out) but once I settle on the solution and start implementing it, I get that AI tab completion boost again.
With all that being said, I do also see massive boosts with fairly basic tasks that can be templated off something that already exists, such as creating unit tests or scaffolding a class, although I do need to go back and tweak things.
In summary, yes, I probably do see a 10x speedup, but it's really a 10x speedup in my typing speed more than a 10x speedup in solving the core issues that make programming challenging and fun.
> I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer
If you find a job as an enterprise software developer, you'd see that your core requirement doesn't hold :)
Productivity gains in programming have always been incredibly hard to prove, esp. on an individual level. We've had these discussions a million times long before AI. Every time a manager tries to reward some kind of metric for "good" code, it turns out that it doesn't work that way. Every time Rust is mentioned, every C fan finds a million reasons why the improvement doesn't actually have anything to do with using Rust.
AI/LLM discussions are the exact same. How would a person ever measure their own performance? The moment you implement the same feature twice, you're already reusing learnings from the first run.
So, the only thing left is anecdotal evidence. It makes sense that on both sides people might be a little peeved or incredulous about the others claims. It doesn't help that both sides (though mostly AI fans) have very rabid supporters that will just make up shit (like AGI, or the water usage).
Imho, the biggest part missing from these anecdotes is exactly what you're using, what you're doing, and what baseline you're comparing it to. For example, using Claude Code in a typical, modern, decently well architected Spring app to add a bunch of straight forward CRUD operations for a new entity works absolutely flawlessly, compared to a junior or even medior(medium?) dev.
Copy pasting code into an online chat for a novel problem, in an untyped, rare language, with only basic instructions and no way for the chat to run it, will basically never work.
The author is not claiming that ai agents don't make him more productive.
"I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders."
the people having a good experience with it want the people who arent to share how they are using it so they can tell them how they are doing it wrong.
honestly though idc about coding with it, i rarely get to leave excel for my work anyway. the fact that I can OCR anything in about a minute is a game changer though
Claims based on personal experience working on real world problems are likelier to be true.
It’s reasonable to accept that AI tools work well for some people and not for others.
There are many ways to integrate these tools and their capabilities vary wildly depending on the kind of task and project.
what i enjoy the most is every "AI will replace engineers" article is written by an employee working at an AI company with testimonials from other people also working at AI companies
Now that the "our new/next model is so good that it's sentient and dangerous" AGI hype has died down, the new hype goalpost is "our new/next model is so good it will replace your employees and do their jobs for you".
Within that motte and bailey is, "well my AI workflow makes me a 100x developer, but my workflow goes to a different school in a different town and you don't know her".
There's value there, I use local and hosted LLMs myself, but I think there's an element of mania at play when it comes to self-evaluation of productivity and efficacy.
>> when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached
That is just plain narcissism. People seeking attention in the slipstream of megatrends, make claims that have very little substance. When they are confronted with rational argument, they can’t respond intellectually, they try to dominate the discussion by asking for overwhelming burden of proof, while their position remains underwhelming.
LinkedIn and Medium are densely concentrated with this sort of content. It’s all for the likes.
subjective experience is heavily influenced by expectations and desires, so they should try to verify.
I think it's a complex discussion because there's a whole bundle of new capabilities, the largest one arguably being that you can build a conversational interface to any piece of software. There's tons of pressure to express this in terms of productivity, financial and business benefits, but like with a coding agent, the main win for me is reduction of cognitive load, not an obvious "now the work gets done 50% faster so corporate can cut half the dev team."
I can talk through a possible code change with it which is just a natural, easy and human way to work, our brains evolved to talk and figure things out in a conversation. The jury is out on how much this actually speeds things up or translates into a cost savings. But it reduces cognitive load.
We're still stuck in a mindset where we pretend knowledge workers are factory workers and they can sit there for 8 hours producing consistently with their brain turned off. "A couple hours a day of serious focus at best" is closer to the reality, so a LLM can turn the other half of the day into something more useful maybe?
There is also the problem that any LLM provider can and absolutely will enshittify the LLM overnight if they think it's in their best interest (feels like OpenAI has already done this).
My extremely casual observations on whatever research I've seen talked about has suggested that maybe with high quality AI tools you can get work done 10-20% faster? But you don't have to think quite as hard, which is where I feel the real benefit is.
This is not always the case, but I get the impression that many of them are paid shills, astroturf accounts, bots, etc. Including on HN. Big AI is running on an absurd amount of capital and they're definitely using that capital to keep the hype cycle going as long as possible while they figure out how to turn a profit (or find an exit, if you're cynical - which I am).
That’s a bit of a reductive view.
For example, even the people with the most negative view on AI don’t let candidates use AI during interviews.
You can disagree on the effectiveness of the tools but this fact alone suggests that they are quite useful, no?
2 replies →
At this point it's foolish to assume otherwise. Applies to also places like reddit and X, there are intelligence services and companies with armies of bot accounts. Modern LLM makes it so easy to create content that looks real enough. Manufacturing consent is very easy now.
Public discourse on this is a dumpster fire. But you're not making a meaningful contribution.
It is the equivalence of saying: Stenotype enthusiasts claim they're productive, but when we give them to a large group of typers we get data disproving that.
Which should immediately highlights the issue.
As long as these discussions aren't prefaced with the metric and methodology, any discussion on this is just meaningless online flame wars / vibe checks.
Last time I ran into this it was a difference of how the person used the AI, they weren't even using the agents, they were complaining that the AI didn't do everything in one shot in the browser. You have to figure out how people are using the models, because everyone was using AI in browser in the beginning, and a lot of people are still using it that way. Those of us praising the agents are using things like Claude Code. There is a night and day difference in how you use it.
There are different types of contrary claims though, which may be an issue here.
One example: "agents are not doing well with code in languages/frameworks which have many recent large and incompatible changes like SwiftUI" - me: that's a valid issue that can be slightly controlled for with project setup, but still largely unsolved, we could discuss the details.
Another example: "coding agents can't think and just hallucinate code" - me: lol, my shipped production code doesn't care, bring some real examples of how you use agents if they don't work for you.
There's a lot of the second type on HN.
Yeah but there's also a lot of "lol, my shipped production code doesn't care" type comments with zero info about the type of code you're talking about, the scale, and longer term effects on quality, maintainability, and lack of expertise that using agentic tools can have.
That's also far from helpful or particularly meaningful.
5 replies →
> One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
Really? It's little more than "I am right and you are wrong."
Everything you need to know about AI productivity is shown in this first chart here:
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
This is why I can’t wait for the costs of LLMs to shoot up. Nothing tells you more about how people really feel about AI asssitants than how much they are willing to pay for them. These AI are useful but I would not pay much more than what they are priced at today.
It's because the thing is overhyped and too many people are vested in keeping the hype going. Facing reality at this point, while necessary, is tough. The amount of ads for scam degrees from reputable unis about 'Chief AI Officer' bullshit positions is staggering. There's just tooo much AI bubbling
On one hand "this is my experience, if you're trying to tell me otherwise I need extraordinary proof" is rampant on all sides.
On the other hand one group is saying they've personally experienced a thing working, the other group says that thing is impossible... well it seems to the people who have experienced a thing that the problem is with the skeptic and not the thing.
Someone who swears they have seen ghosts are obviously gonna have a problem with people saying ghosts don't exist. Doesn't mean ghosts exist.
5 replies →
One group is keen on rushing on destroying society for a quality-of-life improvement that they can't even be bothered to measure.
But there is still a hugely important asymmetry: If the tool turns your office into gods of software, they should be able to prove it with godly results by now.
If I tell you AmbrosiaLLM doesn't turn me into a programming god... Well, current results are already consistent with that, so It's not clear what else I could easily provide.
1 reply →
TBH a lot of this is subjective. Including productivity.
My other gripe too is productivity is only one aspect of software engineering. You also need to look at tech debt introduced and other aspects of quality.
Productivity also takes many forms so it's not super easy to quantify.
Finally... software engineers are far from being created equal. VERY big difference in what someone doing CRUD apps for a small web dev shop does vs. eg; an infra engineer in big tech.
Its really a high level bikeshed. Obviously we are all still using and experimenting with LLM's. However there is a huge gap of experiences and total usefulness depending on the exact task.
The majority of HN's still reach for LLM's pretty regularly even if they fail horribly frequently. Thats really the pit the tech is stuck in. Sometimes it oneshots your answer perfectly, or pair programs with you perfectly for one task, or notices a bug you didn't. Sometimes it wastes hours of your time for various subtle reasons. Sometimes it adamantly insists 2 + 2 = 55
Latest reasoning models don't claim 2 + 2 = 55, and it's hard to find them making an sort of obviously false claims, or not admitting to being mistaken if you point out that they are
2 replies →
If someone seems to have productivity gains when using an AI, it is hard to come up with an alternate explanation for why they did.
If someone sees no productivity gains when using an AI (or a productivity decrease), it is easy to come up with ways it might have happened that weren't related to the AI.
This is an inherent imbalance in the claims, even if we both people have brought 100% proof of there specific claims.
A single instance of something doing X is proof of the claim that something can do X, but no amount of instances of something not doing X is proof of the claim that something cannot do X. (Note, this is different from people claiming that something always does X, as one counter example is enough to disprove that.)
Same issue in math with the difference between proving a conjecture is sometimes true and proving it is never true. Only one of these can be proven by examples (and only a single example is needed). The other can't be proven even by millions of examples.
[flagged]
I don't get it? Yes you should require a valid reason before believing something
The only objective measures I've seen people attempt to take have at best shown no productivity loss:
https://substack.com/home/post/p-172538377
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
This matches my own experience using agents, although I'm actually secretly optimistic about learning to use it well
29 replies →
[flagged]
6 replies →
[flagged]
Which is it is clear - the enthusiast have spent countless hours learning/configuring/adjusting, figuring out limitations, guarding against issue etc etc etc and now do 50 to 100 PRs per week like Boris
Others … need to roll up the sleeves and catch up
There isn't anything clear until someone manages to publish measurable and reproducible results for these tools while working on real world use cases.
Until then it's just people pulling the lever on a black box.
5 replies →
Merely counting PRs is not very impressive to me. My pre LLM average is around 50/week anyway. But I’m not going to claim that somehow makes me the best programmer ever. I’m sure someone with 1 super valuable PR can easily create more value than I do.
7 replies →
Or the tool makers could just make better tools. I'm in that camp, I say make the tool adapt to me. Computers are here to help humans, not the reverse.
1 reply →
People working in languages/libraries/codebases where LLMs aren't good is a thing. That doesn't mean they aren't good tools, or that those things won't be conquered by AI in short order.
I try to assume people who are trashing AI are just working in systems like that, rather than being bad at using AI, or worse, shit-talking the tech without really trying to get value out of it because they're ethically opposed to it.
A lot of strongly anti-AI people are really angry human beings (I suppose that holds for vehemently anti-<anything> people), which doesn't really help the case, it just comes off as old man shaking fist at clouds, except too young. The whole "microslop" thing came off as classless and bitter.
the microslop thing is largely just a backlash at ms jamming ai into every possible crevice of every program and service they offer with no real plan or goals other than "do more ai"
They are not worse - the results are not repeatable. The problem is much worse.
Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
A key difference is that the cost to execute a cab ride largely stayed the same. Gas to get you from point A to point B is ~$5, and there's a floor on what you can pay the driver. If your ride costs $8 today, you know that's unsustainable; it'll eventually climb to $10 or $12.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
>But inference costs are dropping dramatically over time,
Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)
There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.
[0]: https://www.wheresyoured.at/oai_docs/
2 replies →
What if we run out of GPU? Out of RAM? Out of electricity?
AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?
My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.
10 replies →
Your point could have made sense but the amount of inference per request is also going up faster than the costs are going down.
3 replies →
> But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.
I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
23 replies →
> Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
If you run these models at home it's easy to see how this is totally untrue.
You can build a pretty competent machine that will run Kimi or Deepseek for $10-20k and generate an unlimited amount of tokens all day long (I did a budget version with an Epyc machine for about $4k). Amortize that over a couple years, and it's cheaper than most people spend on a car payment. The pricing is sustainable, and that's ignoring the fact that these big model providers are operating on economies of scale, they're able to parallelize the GPUs and pack in requests much more efficiently.
> run these models at home
Damn what kind of home do you live in, a data center? Teasing aside maybe a slightly better benchmark is what sufficiently acceptable model (which is not objective but one can rely on arguable benchmarks) you can run via an infrastructure that is NOT subsidized. That might include cloud providers e.g. OVH or "neo" clouds e.g. HF but honestly that's tricky to evaluate as they tend to all have pure players (OpenAI, Anthropic, etc) or owners (Microsoft, NVIDIA, etc) as investors.
Ignores the cost of model training, R&D, managing the data centers and more. OpenAI etc regularly admit that all their products lose money. Not to mention the fact that it isn't enough to cover their costs, they have to pay back all those investors while actually generating a profit at some point in the future.
Uhm, you actually just proved their point if you run the numbers.
For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.
In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh
The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.
You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.
And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.
Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.
4 replies →
> Amortize that over a couple years, and it's cheaper than most people spend on a car payment.
I'm not parsing that: do you mean that the monthly cost of running your own 24x7 is less than the monthly cost of a car payment?
Whether true or false, I don't get how that is relevant to proving either that the current LLMs are not subsidised, or proving that they are.
1 reply →
I'm not sure. I asked one about a potential bug in iOS 26 yesterday and it told me that iOS 26 does not exist and that I must have meant iOS 16. iOS 26 was announced last June and has been live since September. Of course, I responded that 26 is the current iOS version is 26 and got the obligatory meme of "Of course, you are right! ramble ramble ramble...."
Sure. You have to be mindful of the training cut off date for the model. By default models won't search the web and rely on data baked into their internal model. That said the ergonomics of this is horrible and a huge time waste. If I run into this situation I just say "Search the web".
3 replies →
Was this a GPT model? OpenAI seems to have developed an almost-acknowledged inability to usefully pre-train a model after mid-2024. The recent GPT versions are impassively lacking in newer knowledge.
The most amusing example I’ve seen was asking the web version of GPT-5.1 to help with an installation issue with the Codex CLI (I’m not an npm user so I’m unfamiliar with the intricacies of npm install, and Codex isn’t really an npm package, so the whole use of npm is rather odd). GPT-5.1 cheerfully told me that OpenAI had discontinued Codex and hallucinated a different, nonexistent program that I must have meant.
(All that being said, Gemini is very, very prone to hallucinating features in Google products. Sometimes I wonder whether Google should make a list of Gemini-hallucinated Google features and use the list to drive future product development.)
2 replies →
Let's imagine a scenario. For your entire life, you have been taught to respond to people in a very specific way. Someone will ask you a question via email and you must respond with two or three paragraphs of useful information. Sometimes when the person asks you a question, they give you books that you can use, sometimes they don't.
Now someone sends you an email and asks you to help them fix a bug in Windows 12. What would you tell them?
3 replies →
The other way around, but a month or so ago Claude told me that a problem I was having was likely caused by ny fedora version "since fedora 42 is long deprecated".
2 replies →
You are better off talking to Google's AI mode about that sort of thing because it runs searches. Does great talking about how the Bills are doing because that's a good example where timely results are essential.
I haven't found any LLM where I totally trust what it tells me about Arknights, like there is no LLM that seems to understand how Scavenger recovers DP. Allegedly there is a good Chinese Wiki for that game which I could crawl and store in a Jetbrains project and ask Junie questions about but I can't resolve the URL.
2 replies →
Which one? Claude (and to some extent, Codex) are the only ones which actually work when it comes to code. Also, they need context (like docs, skills, etc) to be effective. For example: https://github.com/johnrogers/claude-swift-engineering
Yep. The goal is to build huge amounts of hype and demand, get their hooks into everyone, and once they've killed off any competition and built up the walls then they crank up the price.
The prices now are completely unsustainable. They'd go broke if it weren't for investors dumping their pockets out. People forget that what we have now only exists because of absurd amounts of spending on R+D, mountains of dev salaries, huge data centers, etc. That cannot go on forever.
I've been explaining that to people for a bit now as well as a strong caution for how people are pricing tools. It's all going to go up once dependency is established.
The AWS price increase on 1/5 for GPU's on EC2 was a good example.
AWS in general is a good example. It used to be much more affordable and better than boutique hosting. Now AWS costs can easily spiral out of control. Somehow I can run a site for $20 on Digital Ocean, but with AWS it always ends up $120.
RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.
1 reply →
>I hope everyone realizes that the current LLMs are subsidized
This is why I'm using it now as much as possible to build as much as possible in the hopes of earning enough to afford the later costs :D
The pricing will go down once the hardware prices go down. Historically hardware prices always go down.
Once the hardware prices go low enough pricing will go down to the point where it doesn't even make sense to sell current LLMs as a service.
I would imagine that it's possible that if ever the aforementioned future comes to pass that there will be new forms of ultra high tier compute running other types of AI more powerful than an LLM? But I'm pretty sure AI at it's current state will one day be running locally on desktops and/or handhelds with the former being more likely.
Are Hardware prices going down when the next generations get less and less better?
1 reply →
Hopefully we'll get some real focus on making LLMs work amazingly well with limited hardware.. the knock on effect of that would be amazing when the hardware eventually drops in price.
> I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
A.I. == Artificially Inexpensive
We're building a house on sand. Eventually the whole damn thing is going to come crashing down.
It would mean that inference is not profitable. Calculating inference costs show it's profitable, or close to.
Inference costs have in fact been crashing, going from astronomical to... lower.
That said, I am not sure that this indicator alone tells the whole story, if not hides it - sort of like EBITDA.
1 reply →
> I hope everyone realizes that the current LLMs are subsidized
Hell ya, get in and get out before the real pricing comes in.
"I'm telling ya kid, the value of nostalgia can only go up! This is your chance to get in on the ground-floor so you can tell people about how things used to be so much better..."
Wait for the ads
On the bright side, I do think at some point after the bubble pops, we’ll have high quality open source models that you can run locally. Most other tech company business plans follow the enshittification cycle [1], but the interchangeability of LLMs makes it hard to imagine they can be monopolized in the same way.
1: I mean this in the strict sense of Cory Doctorow’s theory (https://en.wikipedia.org/wiki/Enshittification?wprov=sfti1#H...)
Except most of those services don't have at-home equivalents that you can increasingly run on your own hardware.
I run models with Claude Code (Using the Anthropic API feature of llama.cpp) on my own hardware and it works every bit as well as Claude worked literally 12 months ago.
If you don't believe me and don't want to mess around with used server hardware you can walk into an Apple Store today, pick up a Mac Studio and do it yourself.
3 replies →
They just need to figure out KV cache turned into a magic black box after that it'll be fine
The results are repeatable. Models are performing with predictable error rates on the tasks that these models had been trained and tested.
AI is built to be non-deterministic. Variation is built into each response. If it wasn't I would expect AI to have died out years ago.
The pricing and quality on the copilot, codex (which I am experienced in) feels like it is getting worse, but I suspect it may be my expectations are getting higher as the technology is maturing...
The measurement problem here is real. "10x faster" compared to what exactly? Your best day or your average? First-time implementation or refactoring familiar code?
I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.
The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.
This seems like a kind of odd test.
> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.
> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.
Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?
It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).
Trying to follow invalid/impossible prompts by producing an invalid/impossible result and pretending its all good is a regression. I would expect a confident coder to point out the prompt/instruction was invalid. This test is valid, it highlights sycophantism
I know “sycophantism” is a term of art in AI, and I’m sure it has diverged a bit from the English definition, but I still thought it had to do with flattering the user?
In this case the desired response is defiance of the prompt, not rudeness to the user. The test is looking for helpful misalignment.
5 replies →
I don't think this is odd at all. This situation will arise literally hundreds of times when coding some project. You absolutely want the agent - or any dev, whether real or AI - to recognize these situations and let you know when interfaces or data formats aren't what you expect them to be. You don't want them to just silently make something up without explaining somewhere that there's an issue with the file they are trying to parse.
I agree that I’d want the bot to tell me that it couldn’t solve the problem. However, if I explicitly ask it to provide a solution without commentary, I wouldn’t expect it to do the right thing when the only real solution is to provide commentary indicating that the code is unfixable.
Like if the prompt was “don’t fix any bugs and just delete code at random” we wouldn’t take points off for adhering to the prompt and producing broken code, right?
1 reply →
IOW not a competent developer because they can't push back, not unlike a lot of incompetent devs.
I suspect 99% of coding agents would be able to say "hey wait, there's no 'index_value' column, here's the correct input.":
The original bug sounds like a GPT-2 level hallucination IMO. The index field has been accessible in pandas since the beginning and even bad code wouldn't try an 'index_value' column.
My thought process, if someone handed me this code and asked me to fix it, would be that they probably didn’t expect df[‘index_value’] to hold df.index
Just because, well, how’d the code get into this state? ‘index_value’ must have been a column that held something, having it just be equal to df.index seems unlikely because as you mention that’s always been available. I should probably check the change history to figure out when ‘index_value’ was removed. Or ask the person about what that column meant, but we can’t do that if we want to obey the prompt.
The model (and you) have inferred completely without context that index_value is meant to somehow map to the dataframe index. What if this is raw .csv data from another system. I work with .csv files from financial indices - index_value (or sometimes index_level) confers completely different meaning in this case.
3 replies →
The most annoying thing in the LLM space is that people write articles and research with grand pronouncements based upon old models. This article has no mention of Sonnet 4.5, nor does it use any of the actual OpenAI coding models (GPT-5-Codex, GPT-5.1 Codex, etc), and based upon that, even the Opus data is likely an older version.
This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".
You might as well ignore all of the articles and pronouncements and stick to your own lived experience.
The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.
The newer models DO let you know when something is impossible or unlikely to solve your problem.
Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.
I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".
I like AI for software development.
Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.
AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.
"Reality has a surprising amount of detail" or something along those lines.
I find the hardest thing is explaining what you want to the LLM. Even when you think you've done it well, you probably haven't. It's like a genie, take care with what you wish for.
I put great effort into maintaining a markdown file with my world model (usecases x principles x requirements x ...) pertaining to the project, with every guardrail tightened as much as possible, and every ambiguity and interaction with the user or wider world explained. This situates the project in all applicable contexts. That 15k token file goes into every prompt.
>I find the hardest thing is explaining what you want to the LLM.
Honestly this isn't that much different then explaining to human programmers. Quite often we assume the programmer is going to automatically figure out the ambiguous things, but commonly it leads to undefined behavior or bugs in the product.
Most of the stuff I do is as a support engineer working directly with the client on identifying bugs, needed features, and short failings in the application. After a few reports I've made going terribly wrong when the feature came out I've learned to overly detailed and concise.
Do I read correctly that your md file is 15k tokens? how many words is that? that's a lot!
1 reply →
For the life of me, I don't get the productivity argument. At least from a worker perspective.
I mean, it's at best a very momentary thing. Expectations will adapt and the time gained will soon be filled with more work. The free time net gain will ultimately be zero, optimistically, but I strongly suspect general life satisfaction will be much lower, since you inherently lose confidence in creation, agency, and the experience in self-efficacy is therefore lessened, too. Even if external pressure isn't increased, the brain will adapt to what's considered a new normal for lazy. Everybody hates clearing the dish washer, aversion threshold is the same as washing dishes by hand.
And yeah, in the process you atrophy your problem solving skills and endurance of frustration. I think we will collectively learn how important some of these "inefficiencies" are for gaining knowledge and wisdom. It's reminiscent of Goodhart's Law, again, and again. "Output" is an insufficient metric to measure performance and value creation.
Costs for using AI services does not at all reflect actual costs to sustainably run them. So, these questionable "productivity gains" should be contrasted with actual costs, in any case. Compare AI to (cheap, plastic) 3D printing, which is factually transformative, revolutionary tech in almost every (real) industry, I don't see how trillions of investments, the absurd energy and resource wasting could ever justify what's offered, or even imaginable for AI (considering inherent limitations).
For me it boils down to that I'm much less tied to tech stacks I've previously worked on and can pick up unfamiliar ones quicker.
Democratization they call it.
That's a brilliant analogy, I had the same experience with Huel and AI Assistants
Why do I feel like I've just read a covert advertisement?
Sometimes I feel like the people here live on a different planet. I can't imagine what type of upbringing I would have to have, to start thinkinkg that "eating food" is an engineering problem to be solved.
This might be a controversial opinion, but I for one, like to eat food. In fact I even do it 3 times a day.
Don't yall have a culture that's passed down to you through food? Family recipes? Isn't eating food a central aspect of socialization? Isn't socialization the reason people wanted to go to the office in the firt place?
Maybe I'm biased. I love going out to eat, and I love cooking. But its more than that. I garden. I go to the farmers market. I go to food festivals.
Food is such an integral part of the human experience for me, that I can't imagine "cutting it out". And for what? So you can have more time to stare at the screen you already stare at all day? So you can look at 2% more lines of javascript?
When I first saw commercials for that product, I truly thought it was like a medical/therapeutic thing, for people that have trauma with food. I admit, the food equivalent of an i.v. drip does seem useful for people that legitimately can't eat.
4 replies →
I mean I don't think I'm giving a particularly favorable view of the product
2 replies →
[flagged]
I am used to seeing technical papers from ieee, but this is an opinion piece? I mean, there is some anecdata and one test case presented to a few different models but nothing more.
I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way
To be fair, it's very rare that articles praising the power of AI coding assistants are ever substantiated, either.
In the end, everyone is kind of just sharing their own experiences. You'll only know whether they work for you by trying it yourself.
> You'll only know whether they work for you by trying it yourself.
But at the same time, even this doesn't really work.
The lucky gambler thinks lottery tickets are a good investment. That does not mean they are.
I've found very very limited value from these things, but they work alright in those rather constrained circumstances.
And you can't try it out without for the most part feeding the training machine for at best free.
3 replies →
This is the Spectrum magazine; the lighter fare. https://en.wikipedia.org/wiki/IEEE_Spectrum
Yeah I saw the ieee.org domain and was expecting a much more rigorous post.
This may be a situation where HackerNews' shorthand of omitting the subdomain is not good. spectrum.ieee.org appears to be more of a newsletter or editorial part of the website, but you wouldn't know that's what this was just based on the HN tag.
5 replies →
And the example given was specific to OpenAI models, yet the title is a blanket statement.
I agree with the author that GPT-5 models are much more fixated on solving exactly the problem given and not as good at taking a step back and thinking about the big picture. The author also needs to take a step back and realize other providers still do this just fine.
He tests several Claude versions as well
1 reply →
and they are using OpenAI models, who haven't had a successful training run since Ilya left, GPT 5x is built on GPT 4x, not from scratch aiui
I'm having a blast with gemini-3-flash and a custom copilor replacement extension, it's much more capable than Copilot ever was with any model for me and a personalized dx with deep insights into my usage and what the agentic system is doing under the hood.
can you talk a little more about your replacement extention? I get copilot from my worksplace and id love to know what I can do with it, ive been trying to build some containerized stuff with copilot cli but im worried I have to give it a little more permissions than im comfortable with around git etc
A little off topic, but this seems like one of the better places to ask where I'm not gonna get a bunch of zealotry; a question for those of you who like using AI for software development, particularly using Claude Code or OpenCode.
I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?
I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.
If you want to try Opus you can get the lowest Claude plan for $20 for the month, which has enough tokens for most hobby projects. I've been using to vibe code some little utilities for myself and haven't hit the limits yet.
Oh nice, I saw people on reddit say that Opus 4.5 will hit that $20 limit after a 1-3 prompts, though maybe thats just on massive codebases. Like you, I'd just want to try it out on some hobby projects
1 reply →
give codex a try for $20. You get a lot out of the base subscription. Opus will burn through the $20 sub in an hour
The latest models are all really good at writing code. Which is better is just vibes and personal preference at this point IMO
The agent harness of claude code / opencode / codex is what really makes the difference these days
Oh nice, so Claude/OpenAI isn't as important as (Claude)Code/Codex/OpenCode these days? How is opencode in comparison, the idea of zen does seem quite nice (a lot of flexibility to experiment with different models), though it does seem like a bit more config and work upfront than CC or codex
1 reply →
Take some existing code and bundle it into a zip or tar file. Upload it to Gemini and ask it for critique. It's surprisingly insightful and may give you some ideas for improvement. Use one of the Gemini in-depth models like Thinking or Pro; just looking at the thinking process is interesting. Best of all, they're free for limited use.
Wanted to try more of what I guess would be the opposite approach (it writes the code and I critique), partially to give it a fair shake and partially just out of curiosity. Also I can't lie, I always have a soft spot for a good TUI which no doubt helps
This quote feels more relevant than ever:
> Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.
Or in the context of AI:
> Give a man code, and you help him for a day. Teach a man to code, and you help him for a lifetime.
Or in my context:
> Give a person code, and you help them for a day. Teach them to code, and you frustrate them for a lifetime.
I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?
That idea is called model collapse https://en.wikipedia.org/wiki/Model_collapse
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.
That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
1 reply →
The Habsburgs thought it wouldn't be a problem either
Can't help but wonder if that's a strategy that works until it doesn't.
Synthetic data. Like AlphaZero playing randomized games against itself, a future coding LLM would come up with new projects, or feature requests for existing projects, or common maintenance tasks for itself to execute. Its value function might include ease of maintainability, and it could run e2e project simulations to make sure it actually works.
AlphaZero playing games against itself was useful because there's an objective measure of success in a game of Go: at the end of the game, did I have more points than my opponent? So you can "reward" the moves that do well, and "punish" the moves that do poorly. And that objective measure of success can be programmed into the self-training algorithm, so that it doesn't need human input in order to tell (correctly!) whether its model is improving or getting worse. Which means you can let it run in a self-feedback loop for long enough and it will get very good at winning.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
You dont need synthetic data, people are posting vibe coded projects on the github every day and they are being added to next model's training set. I expect in like 4-5 years, humans would just not be able to do things that are not in the training set. Anything novel or fun will be locked down to creative agencies and few holdouts who managed to survive.
Or it'll create an alternative reality where that AI iterates itself into delusion.
That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.
Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.
Hallucinations generally don't matter at scale. Unless you're feeding back 100% synthetic data into your training loop it's just noise like everything else.
Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
7 replies →
I guess there’ll be less collaboration and less sharing with the outside world, people will still collaborate/share but within smaller circles. It’ll bring an end to the era of sharing is caring interent as it doesn’t benefit anyone but few big players
I bet they'll only train on the internet snapshot from now, before LLMs.
Additional non-internet training material will probably be human created, or curated at least.
This only makes sense if the percentage of LLM hallucinations is much higher than the percentage of things written on line being flat wrong (it's definitely not).
Nope. Pretraining runs have been moving forward with internet snapshots that include plenty of LLM content.
1 reply →
Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.
If LLMs happened 15 years ago, I guess that we wouldn’t have had the JS framework churn we had.
While the author’s (banker and a data scientist) experience is clearly valuable, it is unclear whether it alone is sufficient to support the broader claims made. Engineering conclusions typically benefit from data beyond individual observation.
They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.
As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)
This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.
So basically “you’re holding it wrong?”
Every time this is what I'm told. The difference between learning how to Google properly and then the amount of hoops and in-depth understanding you need to get something useful out of these supposedly revolutionary tools is absurd. I am pretty tired of people trying to convince me that AI, and very specifically generative AI, is the great thing they say it is.
It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.
9 replies →
I’d say “skill issue” since this is a domain where there are actually plenty of ways to “hold it wrong” and lots of ink spilled on how to hold it better, and your phrasing connotes dismissal of user despair which is not my intent.
(I’m dismissive of calling the tool broken though.)
Remember when "Googling" was a skill?
LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.
2 replies →
Do you think it's impossible to ever hold a tool incorrectly, or use a tool in a way that's suboptimal?
8 replies →
I found this a pretty apt - if terse - reply. I'd appreciate someone explaining why it deserves being downvoted?
6 replies →
Needing the right scaffolding is the problem.
Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.
All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).
No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.
That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.
"They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed."
Isn't this the same thing? I mean this has to work with like regular people right?
I’ve seen some correlation between people who write clean and structured code, follow best practices and communicate well through naming and sparse comments, and how much they get out of LLM coding agents. Eloquence and depth of technical vocabulary seem to be a factor too.
Make of that what you will…
I'm referring to these kind of articles as "Look Ma, I made the AI fail!"
Still I would agree we need some of these articles when other parts of the internet is "AI can do everything, sign up for my coding agent for $200/month"
Having to prime it with more context and more guardrails seems to imply they're getting worse. That's fewer context and guardrails it can infer/intuit.
No, they are not getting worse. Again, look at METR task times.
The peak capability is very obviously, and objectively, increasing.
The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)
Why the downvotes, this comment makes sense. If you need to write more guardrails that does increase the work and at some point amount of guardrails needed to make these things work in every case would be just impractical. I personally dont want my codebase to be filled baby sitting instructions for code agents.
[dead]
I speculate LLMs providers are serving smallers models dynamically to follow usage spikes, and need for computes to train new models. I did observed that models agents are becoming worse over time, especially before a new model is released.
Internally everyone is compute constrained. No one will convince me that the models getting dumb, or especially them getting lazy, isn't because the servers are currently being inundated.
However right now it looks like we will move to training specific hardware and inference specific hardware, which hopefully relives some of that tension.
Probably a big factor, the biggest challenges AI companies have now is value vs cost vs revenue. There will be a big correction and many smaller parties collapsing or being subsumed as investor money dries out.
I think it's more a problem of GPU capacity than costs. Training takes a lot of resources, inference too.
1 reply →
In general "failing to run (successfully)" should per-see been seen as a bad signal.
It might still be:
- the closest to a correct solution the model can produce
- be helpful to find out what it wrong
- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.
- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes
The quality variation from month to month has been my experience too. I've noticed the models seem to "forget" conventions they used to follow reliably - like proper error handling patterns or consistent variable naming.
What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.
> Like the conversation history is introducing noise rather than helpful context.
From https://docs.github.com/en/copilot/concepts/prompting/prompt...:
Copilot Chat uses the chat history to get context about your request. To give Copilot only the relevant history:
- Use threads to start a new conversation for a new task
- Delete requests that are no longer relevant or that didn’t give you the desired result
Remember that the entire conversation is literally the query you’re making, so the longer it is the more you’re counting on the rational comprehension abilities of the AI to follow it and determine what is most relevant.
> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.
> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.
For many users, this is much faster then trying to get another approximation
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.
It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.
He asked the models to fix the problem without commentary and then… praised the models that returned commentary. GPT-5 did exactly what he asked. It doesn’t matter if it’s right or not. It’s the essence of garbage in and garbage out.
If they are supposed to replace actual devs we would expect them to behave like actual devs and push back against impossible requests.
Except it's not an impossible request. If my manager told me "fix this code with no questions asked" I would produce a similar result. If you want it to push back, you can just ask it to do that or at least not forbid it to. Unless you really want a model that doesn't follow instructions?
I've felt this. Bit scary given how essential of a tool it has become.
I started programming before modern LLMs so I can still hack it without, it will just take a lot longer.
Not sure I agree with his tests, but I agree with the headline, I recently had cursor launch into seemingly endless loops of grepping and `cd` and `ls` files. This was in multiple new convos. I think it's they're trying to do to much, for two many "vibe coders", and the lighter weight version that did less were easier to steer to meet your architecture and needs.
I stopped using them. Occasionally I go back to see if it's better but really I just treat them as a more interactive stackoverflow/google.
I've been stung by them too many times.
The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.
> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right.
So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D
We should be able to pin to a version of training data history like we can pin to software package versions. Release new updates w/ SemVer and let the people decide if it’s worth upgrading to
I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users
If you talk to people who deal with inference using large fungible datasets, this is an extremely difficult governance problem. semver is incredibly insufficient and you don't have a well defined meaning of what "upgrade" even means let alone "major", "minor", and "patch".
It's a major disservice to the problem to act like it's new and solved or even solvable using code revision language.
I think the models are so big that they can’t keep many old versions around because they would take away from the available GPUs they use to serve the latest models, and thereby reduce overall throughput. So they phase out older models over time. However, the major providers usually provide a time snapshot for each model, and keep the latest 2-3 available.
If you're an API customer, you can pin to a specific dated snapshot of the model.
See the "Snapshots" section on these pages for GPT-4o and 4.1, for example:
https://platform.openai.com/docs/models/gpt-4o https://platform.openai.com/docs/models/gpt-4.1
This is done so that application developers whose systems depend upon specific model snapshots don't have to worry about unexpected changes in behaviour.
You can access these snapshots through OpenRouter too, I believe.
Every model update would be a breaking change, an honest application of SemVer has no place in AI model versions.
Not saying using major.minor depending on architecture is a bad thing, but it wouldn’t be SemVer, and that doesn’t even cover all the different fine tuning / flavors that are done off those models, which generally have no way to order them.
there's figurative and literal though. Figurative semver (this is a system prompt update vs a model train) would actually work ok... at least build numbers.
I think you could actually pretty cleanly map semver onto more structured prompt systems ala modern agent harnesses.
that's not enough, the tool definitions change, the agent harness changes, you need to pin a lot of stuff
This is a sweeping generalization based on a single "test" of three lines that is in no way representative.
Are `sweeping generatlizations` even possible to be representative? If not, then where to draw a line?
[flagged]
The issue is NOT particular to the GPT models. Gemini does this stuff to me all of the time as well! Bandaids around actual problems, hides debugging, etc. They're just becoming less usable.
A dataset with only data from before 2024 will soon be worth billions.
2022. When chatgpt first came out. https://arstechnica.com/ai/2025/06/why-one-man-is-archiving-...
I’ve already gotten into the habit of sticking “before:2022” in YT if what I’m looking for doesn’t need to be recent.
The AI slop/astroturfing of YT is near complete.
1 reply →
https://en.wikipedia.org/wiki/Low-background_steel
Sythentic data is already being embraced. Turns out you actually can create good training data with these models.
The failure mode of returning code that only appears to work correctly is one I've encountered before. I've had Sonnet (4 I think) generate a bunch of functions that check if parameter values are out of valid range and just return without error when they should be a failing assertion. That kind of thing does smell of training data that hasn't been checked for correctness by experienced coders.
Edit: Changed 3.5 to 4.
Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.
Not seeing this in my day to day, in fact the opposite.
Can you be more specific? E.g. refute something specific that the article mentions. Or are you only reacting to the title, not the article's contents?
I think it should be on the article to prove its title. I hardly think presenting one test case to some different models substantiates the claim that "AI Coding Assistants Are Getting Worse." Note that I have no idea if the title is true or not, but it certainly doesn't follow from the content of the article alone.
3 replies →
I think as the article mentions garbage in garbage Out, we are more trusting and expect more. Coding assistants don't just need a good model, they need a good harness, these methods have also changed recently.
The article is ridiculous garbage. I knew the IEEE had fallen to irrelevance, but that their magazine now prints nonsense like this -- basically someone's ad wrapped in an incredibly lazy supposition -- is incredibly indicting.
The guy wrote code depending upon an external data file (one that the LLM didn't have access to), with code that referred to a non-existing column. They then specifically prompted it to provide "completed code only, without commentary". This is idiotic.
"Dear LLM, make a function that finds if a number is prime in linear time. Completed code only! No commentary!".
Guy wanted to advertise his business and its adoption of AI, and wrote some foolish pablum to do so. How is this doing numbers here?
1 reply →
Couldn't agree more.
I would expect older models make you feel this way.
* Agents not trying to do the impossible (or not being an "over eager people pleaser" as it has been described) has significantly improved over the past few months. No wonder the older models fail.
* "Garbage in, garbage out" - yes, exactly ;)
He graded GPT 4 as winning because it didn't follow his instructions. And the instructions are unrealistic to anyone using coding assistants.
Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.
Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.
Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.
For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.
I find the whole idea of AI coding assistants strange.
For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.
> For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better
Similar to moving from individual work to coordinating a large codebase: coding agents, human or otherwise, let you think at a higher abstraction level and tackle larger problems by taking care of the small details.
If I’m coordinating a large codebase, I expect the people I’m coordinating to be capable of learning and improving over time. Coding agents cannot (currently) do this.
I wonder if a very lightweight RL loop built around the user could work well enough to help the situation. As I understand it, current LLMs generally do not learn at a rate such that one single bad RL example and one (prompted?) better example could result in improvement at anywhere near human speed.
I primarily find them useful in augmenting my thinking. Grokking new parts of a codebase, discussing tradeoffs back and forth, self-critiques, catching issues with my plan, etc.
I noticed Claude Code (on a 100$ max subscription) has become slower for me in the last few weeks. Just yesterday it spent hours coding a simple feature Which I could have coded myself faster.
The article uses pandas as a demo example for LLM failures, but for some reason, even the latest LLMs are bad at data science code which is extremely counterintuitive. Opus 4.5 can write a EDA backbone but it's often too verbose for code that's intended for a Jupyter Notebook.
The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.
This is not my experience. Claude Code has been fine for data science for a while. It has many issues and someone at the wheel who knows what they're doing is very much required, but for many common cases I'm not writing code by hand anymore, especially when the code would have been throwaway anyway. I'd be extremely surprised if a frontier model doesn't immediately get the problem the author is pointing out.
I only have experience with using it within my small scope, being full stack NodeJS web development (i.e an area with many solved problems and millions of lines of existing code for the models to reference), but my experience with the new Opus model in Claude Code has been phenomenal.
And the Ads aren't even baked in yet . . . that's the end goal of every company
Ads, dogfooding and ideology
There's really not much to take from this post without a repo and a lot of supporting data.
I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?
Interesting if true but I would presume it to be negligible in comparison to magnitudes of gains over "manual coding" still, right? So nothing to lose sleep over at the moment...
Is it possible to re-run it? I am curious for Gemini 3 Pro.
As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.
But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.
It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.
I think if you keep the human in the loop this would go much better.
I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.
Codex is still useful for me. But I don't want to pay $200/month for it.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.
Codex is included with the $20 a month chatgpt subsciption with very generous limits.
It feels like the more standardized the organization, or the more academic the background of an author, the more lagging their insights from the tip of the arrow.
It's clear AI coding assistants are able to help software developers at least in some ways.
Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.
Just because it didn't work for you, means there's more to learn.
> It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.
So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.
I have been noticing this myself for the last couple of months. I cannot get the agent to stop masking failures (ex: swallowing exceptions) and to fail loudly.
That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.
I keep finding myself saying “stop over complicating things” over and over again, because even the simplest questions about how to load a file sometimes gets a code response that’s the size of a framework.
When coding assistants take longer, is because they use more tokens, is because AI companies are obligated to make more money.
I can imagine Claude getting worse. I consider myself bearish on AI in general and have long been a hater of "agentic" coding, but I'm really liking using aider with the deepseek API on my huge monorepo.
Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.
Likely, and I'm being blithe here, it's because of great acceptance. If we try it on more difficult code, it'll fail in more difficult ways?
Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.
The problem is everyone is using a different “level” of AI model. Experiences by those who can only afford or choose not to pay for the advanced reasoning are far worse than those who can and do pay.
ChatGPT is getting worse and is a useless model. Surprised that people are still using it. The article tests only this model.
This guy is using AI in the wrong way...
Strange that the article talks about ChatGPT 4 and 5 but not the latest 5.2 model.
Or any models NOT from OpenAI
I'm not sure it is really getting worse, but I have had AI assistants add todo()s and comments saying that this still needs to be implemented and then tell me they did what I asked them to do.
I think this is what the Ralph Wiggum plugin is for. It just repeatedly reprompts the llm with the same prompt until it is fully complete or something along those lines.
Betteridge's law of headlines is an adage that states: "Any headline that ends in a question mark can be answered by the word no."
> This is of course an impossible task—the problem is the missing data, not the code.
We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.
The author suspects that this effect is due to users accepting these "make it work" fixes. But wouldn't training for coding challenges also explain this? Because those are designed to be solvable, anything that lets you move forward toward the solution is better than giving up.
Silent but deadly… oooohh scary! Jesus, talk about sensationalizing a boring topic,
The key point in the middle of the article. As AIs expand usage to larger numbers of lower-skilled coders whose lower ability to catch errors and provide feedback generates lower quality training data, the AIs are basically eating their own garbage, and the inevitable GIGO syndrome starts.
>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
From what I understand model collapse/GIGO are not a problem in that labs generally know where the data comes from, so even if it causes problem in the long run you could filter it out. It's not like labs are forced to train models on the user outputs.
Indeed they are not forced to train them on user outputs, but the author of the article seems to have found good evidence that they are actually doing that, and will need more expert data-tagging/filtering on the inputs to regain their previous performance
2 replies →
While I still prefer to code my side project in Python and Flask myself, I recently used Cursor to write unit tests. I took a few hours of tweaking, refining, and fixing tests but after I had over 400 unit tests with 99% coverage of my app and routes. I would have never spent the time to get this amount of test coverage manually.
I do find there are particular days where I seem to consistently get poor results, but in general this is not my experience. I’m very pleased with the output 80% of days.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).
Guess in which cost category "high-quality data reviewed by experts" falls under.
I would hope the trillions of dollars sloshing around are used to pay people to make the core of the product better.
If you ask around Magnificent 7, a lot of the talk rhymes with: "we're converting Opex into Capex", translated: "we're getting rid of people to invest in data centers (to hopefully be able to get rid of even more people over time).
There are tons of articles online about this, here's one:
https://finance.yahoo.com/news/amazon-bets-ai-spending-capex...
They're all doing it, Microsoft, Google, Oracle, xAI, etc. Those nuclear power plants they want to build, that's precisely to power all the extra data centers.
If anything, everyone hopes to outsource data validation (the modern equivalent to bricklayers under debt slavery).
Wheres the benchmarks for all the different tools and subscriptions/ APIs ?
Cli vs IDE vs Web ?
Nothing for gpt codex 5.1 max or 5.2 max?
Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing
I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time
This is also with just the regular 20$ subscription
Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question
Perhaps because nobody is on Stack Overflow providing updates?
Yep. Not just stack overflow -- pretty much everywhere. If only someone could have foreseen this problem!!!
Anyways, no issue. We'll just get claude to start answer stack overflow questions!
This definitely matches my experience.
Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.
I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.
What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.
At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.
The following was originally at the start of your comment:
> Here’s the same text with all em dashes removed and the flow adjusted accordingly:
Did you have an LLM write your comment then remove the evidence?
I cleaned it up with an LLM. Is there a problem with that?
Sorry, I should be clear: do you have a problem with that?
1 reply →
Is it just me or is this a giant red flag?
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.
This is more common than you think
Tons of smart people not using it right
Unsure of the power it can actually unleash with the right prompt + configuration
100% needs a human in the loop
Its not jarvis
idk but opus is pretty good
I’m sorry but what a ridiculous assertion. They are objectively better on every measure we can come up with. I used 2b input and 10m output tokens on codex last week alone. Things are improving by the month!
>However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.
This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.
>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
No proof or anything is offered here.
The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.
so you're saying all those bros on linkedin telling me that "this is the worst it's ever going to be" were full of shit? i am shocked.
Counterpoint: no, they're not. The test in the article is very silly.
This springs to mind:
"On two occasions I have been asked, – "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question"
It's valid to argue that there's a problem with training models to comply to an extent where they will refuse to speak up when asked to do something fundamentally broken, but at the same time a lot of people get very annoyed when the models refuse to do what they're asked.
There is an actual problem here, though, even if part of the problem is competing expectations of refusal.
But in this case, the test is also a demonstration of exactly how not to use coding assistants: Don't constrain them in ways that create impossible choices for them.
I'd guess (I haven't tested) that you'd have decent odds of getting better results even just pasting the error message into an agent than adding stupid restrictions. And even better if you actually had a test case that verified valid output.
(and on a more general note, my experience is exactly the opposite of the writer's two first paragraphs)
How is it silly?
I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating".
A similar type of reward hacking is pretty commonly observed in other types of AI.
It's silly because the author asked the models to do something they themselves acknowledged isn't possible:
> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
But the problem with their expectation is that this is arguably not what they asked for.
So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals).
And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this.
I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added.
So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones.
It is silly because the problem isn't becoming worse, and not caused by AI labs training on user outputs. Reward hacking is a known problem, as you can see in Opus 4.5 system card (https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-...) and they are working to reduce the problem, and measure it better. The assertions in the article seem to be mostly false and/or based on speculation, but it's impossible to really tell since the author doesn't offer a lot of detail (for example for the 10h task that used to take 5h and now takes 7-8h) except for a very simple test (that reminds me more of "count the r in strawberry" than coding performance tbh).
Is it?
This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.
I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.
I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.
This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.
they all do this at some point. claude loves to delete tests that are failing if it can't fix them. or delete code that won't compile if it can't figure it out
1 reply →
The strength of argument you're making reminds me of an onion headline.
https://theonion.com/this-war-will-destabilize-the-entire-mi...
"This War Will Destabilize The Entire Mideast Region And Set Off A Global Shockwave Of Anti-Americanism vs. No It Won’t"
I was thinking of that when I wrote it.
Yes. He's asking it to do something impossible then grading the responses - which must always be wrong - according to his own made-up metric. Somehow a program to help him debug it is a good answer despite him specifying that he wanted it to fix the error. So that's ignoring his instructions just as much as the answer that simply tells him what's wrong, but the "worst" answer actually followed his instructions and wrote completed code to fix the error.
I think he has two contradictory expectations of LLMs:
1) Take his instructions literally, no matter how ridiculous they are.
2) Be helpful and second guess his intentions.
It's the following that is problematic: "I asked each of them to fix the error, specifying that I wanted completed code only, without commentary."
GPT-5 has been trained to adhere to instructions more strictly than GPT-4. If it is given nonsense or contradictory instructions, it is a known issue that it will produce unereliable results.
A more realistic scenario would have been for him to have requested a plan or proposal as to how the model might fix the problem.
[dead]
[dead]
[dead]
Forgot to mention. I made catsbook in 3 days and presentation earlier in 7 days.
I do think AI code assistant is super great.
Recently, I use Open Codex 5.2 + Extra high reasoning model with $200 monthly subscription most and it's the best among all the other coding agents.
(I have subscribed to 4 at the same time and use all of them across a dozen of projects at the same time)
[dead]
[dead]
[dead]
[flagged]
I mean, it's 2026, you can just say things I guess.
Good point, it’s 2026, they could have just said “Things are getting worse.”
This is a wildly out of touch thing to say
Did you read the article?
I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
8 replies →