Comment by hollowturtle

6 days ago

> The coding agents got really good

It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?

Absolutely not, not quite there not even close in my experience.

But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.

But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!

That's why the debate is so polizered imo, there isn't a shared experience

The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.

For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...

And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.

  • I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.

    Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.

    Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...

    • Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.

      The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.

      36 replies →

    • That is the same fight the 2D animators were having with 3D aninmation 30 years ago. The resolution is likely to be the same: the tool wins but the fundamentals stay, and the line between competent and incompetent practitioners moves but does not disappear.

    •   > I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me.
      

      Honestly, I think this is where the big divide is. People have massively different opinions on what "quality" is. Which is okay, but it feels like everyone is working under some assumption that quality is this very clear objective measure that we all agree on. Clearly we don't. We didn't before AI and well... if you can't tell that we don't with AI... you need to take a step back.

      FWIW, I agree with Philip here. I don't think this screams "high quality" to me. I'm also not trying to take a shit on your project. Nothing screams "terrible" to me, but yeah, it does look a bit sloppy. There's no polish to it. It looks like someone that grades on "it works" and that's fine. But it also isn't everyone's cup of tea. Where the sloppiness comes in is like what Philip said. First thing I saw was the gif and well... I think Claude Code is sloppy. But this is also a great example at how and where LLMs visibly fail. Creating a box in text is pretty simple. There's tons of tools to do it. And the LLM 100% knows about characters like ⌜⌝⌞⌟⎜, it just doesn't use them and doesn't care. The code itself also looks very LLM generated.

      It's fine and I don't think you have any reason to be ashamed of it, but I also wouldn't go around boasting that it is an example of high quality work too. And FWIW, I can't think of a single heavily LLM assisted code where I don't have similar feelings. I've seen stuff with more polish, but yeah, they feel off.

        > TUI
      

      This is a space I feel weird in. I love the terminal. I love that there's a lot of new TUIs. But it also feels very weird because it is extremely clear that a lot of these new TUIs were written by people (or machines) that don't really have a lot of experience in the terminal itself. There's a real shared language by people like me who live in the cli. There's a reason people like me can pick up a new tool and guess certain flags and certain ways to use them. It's because of a shared design language that we know of and we end up writing that way because we know it reduces to cognitive load on our peers. But the LLMs? They don't have that shared experience.

      I think this is true for a lot of stuff, not just TUIs or bash tools. Things just smell... off...

    • I think at this point there is no convincing people. Clearly there is value in these tools and it generates code when steered properly. Perhaps your struggles are down to a skill issue.

  • While reading this thread, I literally just caught an agent putting in the following CSS selector in a rule:

    > .row > div > div, .alert

    This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.

    I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.

    • I haven't done any CSS/HTML/JS level work with Claude yet. I've mainly been using it for systems level stuff.

      LLMs have traditionally had problems with visual rendering (the good ol' pelican on the bicycle test). I wonder if this is more of the same?

      4 replies →

    • > I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention.

      Yeah, absolutely. People think you're picking on, like, code formatting and no, dawg, your code doesn't do what you think it does, or it only handles the happiest of happy paths.

      I do find it funny when people get mad about you critiquing their AI project. You didn't even write it, dude.

    • Or they don’t know CSS.

      Amazing how the LLM is godly with things I don’t understand, and falls over completely when it works in my domain… I wonder why that is /s

      5 replies →

  • Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.

    As I commented on another thread

    > If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

    • As a random example of a "hard" problem solved by AI that I couldn't have realistically done myself, despite having decades of wide industry experience:

      Reverse engineering a proprietary protocol from a binary executable.

      I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.

      My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)

      Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.

      There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!

      2 replies →

    • This is a pretty wild take. What percentage of human engineers are creating novel solutions for hard problems, you think? I work in R&D and even my work is 90% doing things that other people already solved. If you are really doing cutting edge SOTA work that has never been done by another human in some form or another, then kudos to you and I want your job.

      3 replies →

    • The comment was directed at:

      > For generating production code even with a lot of steering and baby sitting? Absolutely not, not quite there not even close in my experience.

      As I said, this is an example of using AI successfully to produce a high quality product (one that I use every day).

      But to your point: I am solving hard problems that people really have. You just don't see those because I haven't mentioned them publicly yet. And they won't be released or talked about until they're ready.

    • Claude wrote me a little python script to help me sort and rank all the AI videos I've generated. It also extracted the metadata and organized it into a CSV. I sent it some hex dumps of the header and it got it first try. The header structure of webms generated by comfy are pretty novel.

  • > The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.

    Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.

    It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.

    I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.

    • > The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people.

      I think it’s really down to this. Nobody can agree on what counts as production-quality code. I remember joining a company with what I think (hope) most of us would call horrible quality code. It was an absolute mess, barely compiled with hundreds of warnings, and had uncountable number of bugs. They didn’t even have a bug tracker so nobody even knew how many they had.

      But the people working there already were so proud of it! None of them had ever worked for another company so they had no idea how bad their code was in comparison with the rest of the software industry (which itself is a very low bar). I told the founder we had a huge code quality problem and he looked at me like I had horns growing out of my head.

      When someone says their LLM is producing “production-quality” code, actually look at it and see. Arguing about it on HN is pointless because everyone’s quality bar is different.

    • Absolutely! I find its test generation, properly steered, to be top notch. In many ways it's like having a second head, because it'll spontaneously come up with test paths that I'd normally only get to after a month or so in one of my "aha! What about XYZ?" shower thoughts.

      You'll also notice that Linus doesn't poo-poo AI at all. His only gripe is with people using it wrong, such as flooding security lists with drive-by security reports after pointing their agent to the code and saying "find me some VULNS!!1!1!!"

    • > The code I get from LLM's is usually much better than what I get from my peers

      Then you should seriously question for who you're working for imo.

      > It also isn't lazy.

      It is indeed lazy in my experience, as in being overly zealous when creating useless test cases and ignoring the important ones. I don't want it to test a sum I want to know a test that can "guarantee" me that a further change doesn't break existing code. And producing this high quality in tests is HARD, and requires a lot of steering with agents. This culture of tests code coverage is just wrong, the best code base I worked with had code coverage only on the net percent of code that matters, the rest is covered by for static type checking and integration tests

  • A standard Docker container, with the container UID/GID mirrored to the host user, holding the host user's API keys, with the host user's project directory bind-mounted. The tooling doesn't even use gVisor / Kata by default which could implement the claim made, but in reality this entire project appears to be security theater.

    • I’d like people to notice that those who claim this amazing AI productivity boost are always: pushing out software they don’t know how to judge the quality of and pushing projects that are 70% done. Every. Single. Time.

      I use Claude all the time, it is immensely helpful. It is also very nuanced and requires a high level of expertise in a specific domain to produce quality work. Even then, that take time and effort. Anyone saying otherwise, quite frankly, doesn’t know what they’re doing.

@hollowturtle I'm surprised - do you really find that sota models aren't good enough to generate production code with steering and babysitting? My experience (Claude Code, mostly Opus 4.6) is that it's fantastic at this. At least in JS + TS + Elixir + Ruby. It does indeed need babysitting, my mental model is that it's an exoskeleton not a junior dev, but IME it's a friggin badass exoskeleton, easily 10x-ing my speed on most work. Notably I do NOT --dangerously-skip-permissions nor use claude code's auto mode, I micromanage and lightly review every line it's writing as it writes it, so I rarely have more than 2 sessions generating simultaneously. I suspect that a lot of the disappointment comes in when people try to delegate to it and trust it to not go off the rails. It hasn't earned that trust from me yet (and hasn't needed to yet).

Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.

  • It really depends on the task, but, in my experience, small to medium and bigger codebases, the amount of steering to get quality code is not worth it.

    I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.

    Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.

    If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps

    • > quality code

      Probably where the mismatch is in this discussion. The measure of what is quality code is all over the place. For some, some form of "good enough" is quality. And for others, metrics like terseness, readability, vacuous amounts of comments, cleverness, various fuzzy measures of "idiomatic", etc, make "quality code" much more of a moving target.

    • I think this depends a lot on the task, the existing codebase, and the taste of the operator.

      In general I tend to agree with you if you're talking a codebase you are deeply familiar with, the value-add from have agents write the code probably ranges from very small to negative in most cases.

      On the other hand if you're trying to make changes in systems you are not familiar with, LLMs are a huge speed boost to folks with enough experience to sniff out what would be a bad path essentially via socratic method to the agent.

      Obviously there are no silver bullets and no substitute for judgment. I will say though, I'll tradeoff ugly local code for good data models and interfaces any day of the week, and there is definitely an archetype of engineer that is very precious about code without good judgment on where it matters and where it doesn't.

Coding goodness is just "unevenly distributed."

Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.

Also... I think our era has an intrinsic bias that change=progress, productivity, etc.

Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.

But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.

A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.

Maybe administration was never really a bottleneck.

Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.

Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.

I haven't heard of many belting out features, and increasing prices or sales.

Most bottlenecks are upstream of another bottleneck. Few are a "dam."

I don't know that there was an inflection point. I know that, over the past year, they definitely became useful to me as more than auto complete.

My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:

- Go code that implements the transpiler (parsing Wasm, building an AST)

- Go code that gets generated by serializing the AST to a .go file

- Go code that manipulates the AST (to optimize it), and its effect on the generated code

- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST

- C code that gets compiled to Wasm, then translated to Go, then called by Go

- Go code that gets called by this C code to implement a C stdlib

- WAT and WAST files that are used to implement the Wasm spec tests

I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.

And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).

Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.

https://github.com/ncruces/wasm2go

> But we should stop talking about 1s and 0s

I agree, but you contradicted yourself just one line above.

> For generating production code even with a lot of steering and baby sitting? Absolutely not

Moreover this is further in contradiction with several facts:

1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz

2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties

3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?

  • >> But we should stop talking about 1s and 0s

    > I agree, but you contradicted yourself just one line above.

    >> > For generating production code even with a lot of steering and baby sitting? Absolutely not

    with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not

    • I've quoted you two tools (Ghostty and Redis) whose development now regularly uses AI assistance to deliver production code. I quoted those because their authors shared their experiences, the strengths and the limits of the tooling.

      There's many more, from Flask to Docker, from Ruby to FastAPI or Tanstack. LLVM has integrated AI-generated PRs, so did Swift and Mojo. Sasha Levin has pushed into Linux Nvidia-related kernel changes that were authored by LLMs in 6.15. You can be certain there's a magnitude more where people don't admit or tag their PRs as AI generated or co-generated.

      In fact I am quite confident that projects and developers that are not leveraging the tools are increasingly rare. There's really no reason in 2026 to write a non-trivial PR and not ask a cheap review to an AI tool.

      The industry is changing, I don't really like the trends I'm seeing, but to state that LLMs cannot and are not writing production code, very often quality ones, (especially when used, setup and overviewed properly) is plain denial.

      Your anecdotal experience isn't relevant, especially when applied to the largest parts of the industry, composed of mediocre developers working on terrible codebases.

      9 replies →

I had a really fun day yesterday because anthropics limits on their normal 20$ subscription allowed me to play around for the whole day without hitting a limit.

Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.

The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.

This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.

Small websites, fun projects, helper tools etc.

But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.

We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.

  • > The code it generated hat 0 compiletime errors

    And no spelling errors either!

    Also,

    > Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet

    >> embedding-shape 1 hour ago | root | parent | next [–]

    >>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.

    If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?

    • I'm not a native english speaker and when i mentioned that i might use LLM for fixing spellings, people argued about the use of LLM. So spelling error yes/no?

      I do not understand the quote you rference at all tbh?

  • I don’t see how “fun projects” and “take our jobs” fit together in any voluntary sentence.

    • Firstly i wrote examples but also etc. so its more than just that. It is also refactoring, cicd pipelines and co.

      2 years ago when I prompted something, it had compile time errors left and right. Took me 3-10 iterations to even get it running.

      Now its one shoting a lot. Including websides, refactorings, etc.

      The question is what is missing? How far are we that it can handle huge code bases vs. smaller ones? How far are we that it can comprehend the whole architecture and doesn't try to put a service in a wrong place just becaus the context is too small?

      Mythos is 10 Trillion, that might be already pushing it.

      95% might be not enough for someone in sense of "yeah i can't do the 95% and i can't do the 5% either the AI can do 100% or i still need Kevin with his knowledge even if its just for the last 5%"

      1 reply →

  • "We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant."

    This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.

    • I wrote coding job. And its true for coding jobs.

      Your Product Manager is not a coding job. Your Product Owner is not a coding job.

      vibe-kanban exists you could already do a proper experiment letting your PO maintain a vibe-kanban board with proper requirements and see how an agent progresses.

      But 5% is often enough wwhat breaks it. Doesn't help much when your PM, PO or CEO or CTO have no clue about coding harnesses, coding agents, coding platforms, LLMs etc.

      3 replies →

    • CEO makes fresh account to tell someone that writing code is not the entire job? I don’t buy it.

I'm convinced that the polarization is that one's impression of AI has a direct 1:1 mapping with one's previous level of skill and sensitivity to quality. Most people are by definition average and they are impressed.

Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.

Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.

I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.

  • This is contradicted by the amount of AI use at top tech firms.

    • No, it would be contradicted by the sentiment among devs at top tech firms, but I don't know what the sentiment is. I do know they are being forced to use AI at peril of termination such that their use of AI is a non signal.

My explanation for the lack of shared experience is very language dependent quality. I work in Go and it's gotten really really good. I have to pick the right abstraction and it can be overly verbose at times but it can make in 5 minutes what would have taken me an hour.

Steve Yegge wrote about this in his book Vibe Coding. He says it takes about a year of experience before you're consistently getting good results. He writes about lots of different techniques for doing that, but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire.

  • > but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire

    That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.

    • +1 to all of this. The challenge can be staying focused and thinking when the AI assistant is (1) moving very fast and (2) often times doing multiple things at the same time.

      I know I have struggled to keep up, and fall into the trap of approving things (either commands or recommendations) without taking the time to really process and think about them.

      It's a bit like the age old problem of "it's super easy to ask questions, and can be super hard to answer many of them". So the economy of the conversation gets out of whack fast.

  • It's been 4 years of using them for me, before writing a book I'd wait to have a decade of experience to share with others, otherwise it would have the same value as a book on a react tutorial

  • I'd say closer to 6 months for me but probably still some room to improve.

    I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.

    After having Claude Code "remember" my preferences and tools, it's more efficient.

    It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.

> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.

I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.

This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.

I believe by now we know exactly what it's good at and what it's terrible at.

The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).

It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.

> Absolutely not, not quite there not even close in my experience.

Well... I don't know what you expect but so far I'd like all my colleagues to write code at the level of what I get from codex.

I'm moderately horrified every time claude runs the same broken, YOLO SWAG git commands from stackoverflow, gets errors, tries a few more things, then finally figures out how to commit and push correctly.

Long term, it can be better to slowly refactor parts of your code base into the way the model expects it to be. Sometimes fighting the gradient of code’s uniqueness vs expectation is not worth it.

> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot

and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)

It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.

Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.

I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.

So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.

I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png

Grok is OK for general stuff, never tried it for coding.

Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)

Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol

  • > I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

    I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.

  • Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"

  • I set up a hook that reviews every commit and highlights potential bugs (async) and writes to a report to a dir.

    Then I have a script that summarises that I usually run before pushing or at end of day.

    Works quite well for both improving my code and the code ai wrote.

I first started noticing they were actually useful around Dec 2025, through about February. I got pretty good at using them, and was amazed at their utility, especially Claude and Codex. Then sometime in March, they got really frustratingly dumb. Things that they used to get right in one shot suddenly took several tried, and I had to watch them like a hawk because they constantly made stupid mistakes, not following instructions that previously worked. I had one try to fix a failing test like this:

    assert_eq x, true if x == true

Both Claude and Codex, both with the latest versions and the original versions that had been working.

Now I just use deepseek. It isn't any dumber, and it costs way less.

I'm curious. What have you actually tried? Are you just prompting the LLM with one off tasks? For good results, you need to take the time to read the documentation for the harness you are using and configure your environment. This tuning can take dozens hours to nail down. Then there's the actual approach for working on your projects. Many people that have good results with agentic coding actually spend the bulk of their time in plan mode where they go back and forth with the LLM designing a granular playbook for the task at hand before they ever have it write any code.

  • I'm curious. What makes you think that me sharing an example(which one of the many?) of what I actually tried would somehow add something to the conversation? What's the usefulness of just an anecdotal example?

    As I said we have a plenty of different envs, codebases, requirements. Things are complex.

    You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.

    Let me stress this out again:

    > That's why the debate is so polizered imo, there isn't a shared experience

    • In my experience most people with the type of critique I'm seeing from you have only tried it one time or have not taken the time to invest in an environment/process that will work for agentic coding.

      My question is not so much about sharing a cherry picked example, but the question was more like "have you tried in earnest to make it work". That's the part that wasn't clear from your original post. But you say you have, and you weren't impressed. Fair enough. I'm not trying to convince you otherwise, but I encourage people to give the tools a fair chance before throwing up their hands and deciding it's meh.

      Having said all that, you're right there isn't a shared experience.

      1 reply →

Good is relative. If somebody struggles with getting their hand-written code to compile, the LLM coding agents will look like geniuses to them.

An idiosyncrasy of humanity is that the dumbest individuals tend to also be the loudest.

Which languages and subject matter do you work with?

  • c/c++, java, kotlin, go, some perl scripting, some javascript. Gaming industry

    • We'll there's your problem.

      F1 mechanic pops the hood of a mass-market Toyota Corolla and doesn't understand why everyone says it's really good.

      A lot of us are out here building websites or phone apps.

      Not to say that these things can't also be taken very seriously from first-principles, but I think that's rare.

      2 replies →

Have had fairly good luck with Claude Code Opus 4.7 on xhigh effort.

I think it more reliably does IaC with established patterns especially when it can do a dry run.

Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho

Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.

I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.

you are experiencing reverse Dunning–Kruger effect.

For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.

now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.

  • Please do not cite Dunning–Kruger effect at random.

    Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".

    If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

    LLMs can effectively validate your business idea

    • I don't really see your point. Most problems that people have aren't really super-novel, but just extremely bespoke.

      To give a specific example, 12 months ago I had a client pay me me to make a Chrome plugin that changed the rows in his Shopify Products page to display Quantity and SKU.

      These days you'd just one-shot it in Claude.

      6 replies →

    • I'm beginning to get the sense that Sturgeon's Law is at play here and the non-crap 10% of us are arguing with the 90% for whom LLM's shitty output is actually better than what they could do on their own.

      I've been lucky enough to work at places with majority intelligent engineers with similar tastes on quality to my own... but it seems to be that's not the norm or the case everywhere.

      and it's the 90% that's most vocal. Sturgeon and D-K seen to go hand-in-hand.

      1 reply →

    • The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.

      If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??

      5 replies →

> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

The answer is "for lots of people, but not you".

You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".

  • When I say

    > Absolutely not, not quite there not even close in my experience.

    I obviously mean in my experience, not the real truth.

    > That everyone else is being led by a "marketing hype

    That is obvious instead, and I later say there's not 0s or 1s, every job has his intrincancies