Comment by nickandbro
14 hours ago
I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default. Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent. We are maybe one or two steps from the flywheel being completed. Or maybe we are already there.
I just don’t see it coming. I was full on that camp 3 months ago, but I just realize every step makes more mistakes. It leads into a deadlock and when no human has the mental model anymore.
Don’t you guys have hard business problems where AI just cant solve it or just very slowly and it’s presenting you 17 ideas till it found the right one. I’m using the most expensive models.
I think the nature of AI might block that progress and I think some companies woke up and other will wake up later.
The mistake rate is just too high. And every system you implement to reduce that rate has a mistake rate as well and increases complexity and the necessary exploration time.
I think a big bulk of people is of where the early adaptors where in December. AI can implement functional functionality on a good maintained codebase.
But it can’t write maintable code itself. It actually makes you slower, compared to assisted-writing the code, because assisted you are way more on the loop and you can stop a lot of small issues right away. And you fast iterate everything•
I’ve not opened my idea for 1 months and it became hell at a point. I’ve now deleted 30k lines and the amount of issues I’m seeing has been an eye-opening experience.
Unscalable performance issues, verbosity, straight up bugs, escape hatches against my verification layers, quindrupled types.
Now I could monitor the ai output closer, but then again I’m faster writing it myself. Because it’s one task. Ai-assisted typing isn’t slower than my brain is.
Also thinking more about it FAANG pays 300$ per line in production, so what do we really trying to achieve here, speed was never the issue.A great coder writes 10 production lines per day.
Accuracy, architecture etc is the issue. You do that by building good solid fundamental blocks that make features additions easier over time and not slower
I know it’s not your main point, but I’m curious where $300/line comes from. I don’t think I’ve ever seen a dollar amount attached to a line of production code before.
I think this sounds like a true yet short sighted take. Keep in mind these features are immature but they exist to obtain a flywheel and corner the market. I don’t know why but people seem to consistently miss two points and their implications
- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)
- coding is a verifiable domain
The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.
Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).
Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.
Scaling laws vs combinatorial explosion, who wins? In personal experience claude does exceedingly well on mundane code (do a migration, add a field, wire up this UI) and quite poorly on code that has likely never been written (even if it is logically simple for a human). The question is whether this is a quantitative or qualitative barrier.
Of course it's still valuable. A real app has plenty of mundane code despite our field's best efforts.
8 replies →
> - coding is a verifiable domain
You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.
How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.
14 replies →
But the issue isn't coding, it's doing the right thing. I don't see anywhere in your plan some way of staying aligned to core business strategy, forethought, etc.
The number of devs will reduce but there will still be large activities that can't be farmed out without an overall strategy
5 replies →
[dead]
I love everything about this direction except for the insane inference costs. I don’t mind the training costs, since models are commoditized as soon as they’re released. Although I do worry that if inference costs drop, the companies training the models will have no incentive to publish their weights because inference revenue is where they recuperate the training cost.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
Inference costs at least seem like the thing that is easiest to bring down, and there's plenty of demand to drive innovation. There's a lot less uncertainty here than with architectural/capability scaling. To your point, tomorrow's commodity hardware will solve this for the demands of today at some point in the future (though we'll probably have even more inference demand then).
> I love everything about this direction except for the insane inference costs.
If this direction holds true, ROI cost is cheaper.
Instead of employing 4 people (Customer Support, PM, Eng, Marketing), you will have 3-5 agents and the whole ticket flow might cost you ~20$
But I hope we won't go this far, because when things fail every customer will be impacted, because there will be no one who understands the system to fix it
I worry about the costs from an energy and environmental impact perspective. I love that AI tools make me more productive, but I don't like the side effects.
Environmental impact of ai is greatly overstated. Average person will make bigger positive impact on environment by reducing his meat intake by 25% compared with combined giving up flying and AI use.
1 reply →
[dead]
This is the wrong way to see it. If a technology gets cheaper, people will use more and more and more of it. If inference costs drop, you can throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
> throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
But this is just not true, otherwise companies that can already afford such high prices would have already outpaced their competitors.
1 reply →
I mean theoretically if there are many competitiors the costs of the product should generally drop because competition.
Sadly enough I have not seen this happening in a long time.
I think that as a user I'm so far removed from the actual (human) creation of software that if I think about it, I don't really care either way. Take for example this article on Hacker News: I am reading it in a custom app someone programmed, which pulls articles hosted on Hacker News which themselves are on some server somewhere and everything gets transported across wires according to a specification. For me, this isn't some impressionist painting or heartbreaking poem - the entity that created those things is so far removed from me that it might be artificial already. And that's coming from a kid of the 90s with some knowledge in cyber security, so potentially I could look up the documentation and maybe even the source code for the things I mentioned; if I were interested.
Art is and has always been about the creator.
I don't want software that is built to be art. I want software that is built to provide facilities.
1 reply →
Take a walk in any museum, I'm pretty sure you'll react to some of the art displayed there and find it cool before you read the name of the artist.
3 replies →
We haven’t been inching closer to users writing a half-decent ticket in decades though.
Solutions like https://bugherd.com/ might make the issue context capture part more accurate.
Maybe the agent can ask the user clarifying questions. Even better if it could do it at the point of submission.
Feedback loops like that would be an exercise in raising garbage-in->garbage-out to exponential terms.
It's the "robots will just build/repair themselves" trope but the robots are agents
Yes. Next they'll want nanobots that build/repair themselves.
Oh wait. That's already here and is working fine.
Tusted user like Jia Tan.
I think Anthropic will launch backend hosting off the back of their Bun acquisition very soon. It makes sense to basically run your entire business out of Claude, and share bespoke apps built by Claude code for whatever your software needs are.
100% its going to happen - also OpenAI will do same, there were already rumors about them building internal "github" which is stepping stone for that Also it is requirement for completing lock-in - the dream for these companies.
Ha I just SPECed out a version of this. I have a simple static website that I want a few people to be able to update.
So, we will give these 3 or 4 trusted users access to an on-site chat interface to request updates.
Next, a dev environment is spun up, agent makes the changes, creates PR and sends branch preview link back to user.
Sort of an agent driven CMS for non-technical stakeholders.
Let’s see if it works.
I think some type of tickets can be done like this but your trusted user assumption does a lot of work here. Now I don't see this getting better than that with the current architecture of LLMs, you can do all sorts of feedback mechanisms which helps but since LLMs are not conscious drift is unavoidable unless there is a human in the loop that understands and steers what's going on.
But I do think even now with certain types of crud apps, things can be largely automated. And that's a fairly large part of our profession.
Users are often incorrect about what the software should actually be doing and don’t see the bigger picture.
In the past three weeks a couple of projects I follow have implemented AI tools with their own github accounts which have been doing exactly this. And they appear to be doing good work! Dozens of open issues iterated, tested and closed. At one point i had almost 50 notification for one projects backlog being eradicated in 24 hours. The maintainer reviewed all of it and some were not merged.
I don't know if this is the future, but if it is, why bother building one version of the software for everyone? We can have agents build the website for each user exactly the way they want. That would be the most exciting possibility to come out of AI-generated software.
"why bother building one version of the software for everyone?"
So one user's experience is relevant to another, so they can learn from one another?
What kind of software are people building where AI can just one shot tickets? Opus 4.6 and GPT 5.4 regularly fail when dealing with complicated issues for me.
GPT 5.4 straight up just dies with broken API responses sometimes, let alone when it struggles with a even moderately complex task.
I still can't get a good mental model for when these things will work well and when they won't. Really does feel like gambling...
Not just complicated, but even simple ones if the current software is too “new” of a pattern they’ve never seen before or trained on.
I dunno if Rust async or native platform API's which have existed for years count as new patterns, but if you throw even a small wrench in the works they really struggle. But that's expected really when you look at what the technology is - it's kind of insane we've even gotten to this point with what amounts to fancy autocomplete.
Of course not all tickets are complex. Last week I had to fix a ticket which was to display the update date on a blog post next to the publish date. Perfect use case for AI to one shot.
i dont see anyone sane trusting ai to this degree any time soon, outside of web dev. the chances of this strategy failing are still well above acceptable margins for most software, and in safety critical instances it will be decades before standards allow for such adoption. anyway we are paying pennies on the dollar for compute at the moment - as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
> as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
All Chinese labs have to do to tank the US economy is to release open-weight models that can run on relatively cheap hardware before AI companies see returns.
Maybe that's why AI companies are looking to IPO so soon, gotta cash out and leave retail investors and retirement funds holding the bag.
4 replies →
Several fintechs like Block and Stripe are boasting thousands of AI-generated PRs with little to no human reviews.
Of course it's in the areas where it doesn't matter as much, like experiments, internal tooling, etc, but the CTOs will get greedy.
3 replies →
Even in webdev it rots your codebase unchecked. Although it's incredibly useful for generating UI components, which makes me a very happy webslopper indeed.
1 reply →
[dead]
I know a company already operating like this in the fintech space. I foresee a front page headline about their demise in their future.
The missing piece for me is post-hoc review.
A PR tells me what changed, but not how an AI coding session got there: which prompts changed direction, which files churned repeatedly, where context started bloating, what tools were used, and where the human intervened.
I ended up building a local replay/inspection tool for Claude Code / Cursor sessions mostly because I wanted something more reviewable than screenshots or raw logs.
I dont mean this as a shade but ppl who are not coders now seem to think "coding is now solved" and seem to be pushing absurd ideas like shipping software with slack messages. These ppl are often high up in the chain and have never done serious coding.
Stripe is apparently pushing gazzaliion prs now from slack but their feature velocity has not changed. so what gives?
how is that number of pr is now the primary metric of productivity and no one cares about what is being shipped or if we are shipping product faster. Its total madness right now. Everyone has lost their collective minds.
I ask myself the same question.
I'm not seeing the apps, SaaS, and other tools I use getting better, with either more features or fewer bugs.
Whatever is being shipped, as an end user, I'm just not seeing it.
cto and ceo are now feeling insane pressure to show how they are using ai but its not evident in output. So now they've resorted to blabbering publicly about prs, lines of code ect to save face. And ofcourse ppl giving them voice and platform have their own agendas that prevent them from asking "so what exactly have you shipped stripe from million pr/day".
Its baffling to see these comments on hacknernews though. I guess you have to prove that you are not a luddite by making "ai forward" predictions and show that you "get it"
I think a lot of SWE roles are really bullshit jobs (1) and these have been particularly susceptible to getting sniped with AI tools.
(1) https://en.wikipedia.org/wiki/Bullshit_Jobs
Or perhaps we end up where all software is self evolving via agents… adjusting dynamically to meet the users needs.
The "user" being the one that's in charge of the AI, not the person on the receiving end.
Instead of having a trusted user, you can also do statistics on many users.
(That's basically what A/B testing is about.)
"Trusted user" also can be an Agent.
What you're describing is absolutely where we're headed.
But the entire SWE apparatus can be handled.
Automated A/B testing of the feature. Progressive exposure deployment of changes, you name it.
Haha sure, let's just let every user add their feedback to the software.
I think the Ai agent will directly make a PR - tickets are for humans with limited mental capacity.
At least in my company we are close to that flywheel.
Tickets need to exist purely from a governance perspective.
Tickets may well not look like they do now, but some semblance of them will exist. I'm sure someone is building that right now.
No. It's not Jira.
Yes, so my point is that PRs act as that governance layer - with preview environments, you can see the complexity and risk of the change etc.
The agents have even more limited capacity
At the moment, maybe. But it's growing.
2 replies →
I am already there with a project/startup with a friend. He writes up an issue in GitHub and there is a job that automatically triggers Claude to take a crack at it and throw up a PR. He can see the change in an ephemeral environment. He hasn't merged one yet, but it will get there one day for smaller items.
I am already at the point where because it is just the two of us, the limiting factor is his own needs, not my ability to ship features.
Must be nice working on simple stuff.
Why doesn’t he merge them?
He is not technical but a product guy, so he still wants me to check it over.
We do feedback to ticket automatically
We dont have product managers or technical ticket writers of any sort
But us devs are still choosing how to tackle the ticket, we def don't have to as I’m solving the tickets with AI. I could automate my job away if I wanted, but I wouldn't trust the result as I give a degree of input and steering, and there’s bigger picture considerations its not good at juggling, for now
Then sets up telemetry and experiments with the change. Then if data looks good an agent ramps it up to more users or removes it.
> I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default.
There's a lots of experimentation right now, but one thing that's guaranteed is that the data gatekeepers will slam the door shut[1] - or install a toll-booth when there's less money sloshing about, and the winners and losers are clear. At some point in the future, Atlassian and Github may not grant Anthropic access to your tickets unless you're on the relevant tier with the appropriate "NIH AI" surcharge.
1. AI does not suspend or supplant good old capitalism and the cult of profit maximization.
[dead]
Um, we are already there...