← Back to context

Comment by pornel

2 months ago

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

  • I love that you’re getting straightforward replies to this absolutely sick burn. The blade is so sharp that some people aren’t even feeling it.

  • LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".

    But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.

    • Huh, that explains a lot about the F500, and their buzzword slogans like "culture of excellence".

      LLM code is still mostly absurdly bad, unless you tell it in painstaking detail what to do and what to avoid, and never ask it to do a bigger job at a time than a single function or very small class.

      Edit: I'll admit though that the detailed explanation is often still much less work than typing everything yourself. But it is a showstopper for autonomous "agentic coding".

      22 replies →

    • Uhuh. Let me present you Rudolph. For the next 15 minutes, he will paste pieces of top rated SO answers and top starred GH repos. Then he will suffer complete amnesia. He might not understand your question or remember what he just did, but the code he pastes is higher quality than any codes you have seen in your 20 years in F500! For 20$ a month, he's all yours, he just needs a 4 hour break every 5 hours. But he runs on money, like gumball machine, so you can wake him with a donation. Oh, you are responsible for giving him precise instructions, that he often ignores in favour of other instructions from uncle Sam. No, you can't see them.

    • Offshoring pretty much guarantees a couple vibe coders will be there to operate

    • You've worked at some shitty places. Nothing I've seen from Claude matches even my worst coworker (and my last job was an F500)

  • If you a) know what you are doing and b) know what an llm is capable of doing, c) can manage multiple llm agents at a time, you can be unbelievably productive. Those skills I think are less common than people assume.

    You need to be technical, have good communication skills, have big picture vision, be organized, etc. If you are a staff level engineer, you basically feel like you don’t need anyone else.

    OTOH i have been seeing even fairly technical engineering managers struggle because they can’t get the LLMs to execute because they don’t know how to ask it what to do.

    • > can manage multiple llm agents at a time

      How is that supposed to work? Humans are notoriously poor at multi-tasking. If you spend all day context switching between agents you’re going to have a bad time.

      1 reply →

    • it's like that '11 rules for showrunning' doc where you need to operate at a level where you understand the product being made, and the people making it, and their capabilities, in order to make things come out well without touching them directly.

      (https://okbjgm.weebly.com/uploads/3/1/5/0/31506003/11_laws_o...)

      if you can do every job + parallelize + read fast, and you are only limited by the time it takes to type, claude is remarkable. I'm not superhuman in those ways but in the small domains where I am it has helped a lot; in other domains it has ramped me to 'working prototype' 10x faster than I could have alone, but the quality of output seems questionable and I'm not smart enough to improve it

  • For me, I'll do the engineering work of designing a system, then give it the specific designs and constraints. I'll let it plan out the implementation, then I give it notes if it varies in ways I didn't expect. Once we agree on a solution, that's when I set it free. The frontier models usually do a pretty good job with this work flow at this point.

    • That’s vibe coding and you won’t read more than 20% of the code written that way. You really can’t build complex software that way

      1 reply →

  • Really? Because this perfectly explains why it will never replace them: it needs an exact language listing everything required to function as you expect it.

    You need code to get it to generate proper code.

    • I think GP was a joke about the ability of a typical programmer.

      I certainly read it as one and found it funny.

> If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

  • Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.

    I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.

  • And it’s slower to review because you didn’t do the hard part of understanding the code as it was being written.

    • You're holding it wrong.

      Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.

      ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.

      To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).

      Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.

      13 replies →

    • The same as asking one of your JRs to do something except now it follows instructions a little bit better. Coding has never been about line generation and now you can POC something in a few hours instead of a few days / weeks to see if an idea is dumb.

      5 replies →

  • "Several hours"? How big are your change sets?

    If a human dropped a PR on me that took "several hours" to go through (10k+ lines or non-trivial changes), I'd jump in my car and drive to the office just to specifically slap them on the back of the head ffs.

    • This was like 1K LOC? It's not the review that was slow, but the wrestling with the model to get the code to not suck.

I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

  • +1 if you are vibe coding projects from scratch. if the architecture you specify doesn't make sense, the llm will start struggling, the only way out of their misery is mocking tests. the good thing is that a complete rewrite with proper architecture and lessons learned is now totally affordable.

    • I think the best thing about LLMs is how incredibly easy they make it to build one to throw away.

      I've definitely built the same thing a few times, getting incrementally better designs each time.

Not trying to be snarky, with all due respect... this is a skill issue.

It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

> If you tell them the code is slow

Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

The more you stick to convention the better off you'll be. And use planning mode.

  • > Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?

    Yes? Why don't you?

    They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).

    • So, you observed some telemetry - which would have been some sort of specific metric, right? Wouldn't you communicate that to them as well, not just "it's slow"?

      "Hey, I saw that metric A was reporting 40% slower, are you aware already or have any ideas as to what might be causing that?"

      Those two approaches are going to produce rather distinctly different results whether you're speaking to a human or typing to a GPU.

    • Yeah if my co-worker can't start figuring out why the code is slow, with a reasonable reference to what the code in question is, that is a knock against their skills. I would actually expect some ideas as to what the problem is just off the top of their heads, but that the coding agent can't do that isn't a hit against it specifically, this is now a good part of what needs to be done differently.

      The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.

      2 replies →

    • ...no?

      "Your code is slow" is essentially meaningless.

      A normal human conversation would specify which code/tasks/etc., how long it's currently taking, how much faster it needs to be, and why. And then potentially a much longer conversation about the tradeoffs involved in making in faster. E.g. a new index on the database that will make it gigabytes larger, a lookup table that will take up a ton more memory, etc. Does the feature itself need to be changed to be less capable in order to achieve the speed requirements?

      If someone told me "hey your code is slow" and walked away, I'd just laugh, I think. It's not a serious or actionable statement.

      1 reply →

    • Well, I would say something like "We seem to be having some performance issues the business has noticed in the XYZ stuff. Shall we sit down together and see if we can work out if we can improve things?"

    • There was a 20+ person team of well paid, smart (mostly Java) programmers that dealt for months with slow application they were building, that everyone knew was slow. I nagged them for weeks to set up indexes even for small, 100 row tables. Once they did things started running orders of magnitude faster.

      Your expectations for people (and LLMs) are way too high.

  • My comment was a summary of the situation, not literal prompts I use. I absolutely realize the work needs to be adequately described and agents must be steered in the right direction. The results also vary greatly depending on the task and the model, so devs see different rates of success.

    On non-trivial tasks (like adding a new index type to a db engine, not oneshotting a landing page) I find that the time and effort required to guide an LLM and review its work can exceed the effort of implementing the code myself. Figuring out exactly what to do and how to do it is the hard part of the task. I don't find LLMs helpful in that phase - their assessments and plans are shallow and naive. They can create todo lists that seemingly check off every box, but miss the forest for the trees (and it's an extra work for me to spot these problems).

    Sometimes the obvious algorithm isn't the right one, or it turns out that the requirements were wrong. When I implement it myself, I have all the details in my head, so I can discover dead-ends and immediately backtrack. But when LLM is doing the implementation, it takes much more time to spot problems in the mountains of code, and even more effort to tell when it's a genuinely a wrong approach or merely poor execution.

    If I feed it what I know before solving the problem myself, I just won't know all the gotchas yet myself. I can research the problem and think about it really hard in detail to give bulletproof guidance, but that's just programming without the typing.

    And that's when the models actually behave sensibly. A lot of the time they go off the rails and I feel like a babysitter instructing them "no, don't eat the crayons!", and it's my skill issue for not knowing I must have "NO eating crayons" in AGENTS.md.

  • Great answer, and the reason some people have bad experiences is actually patently clear: they don’t work with the AI as a partner, but as a slave. But even for them, AI is getting better at automatically entering planning mode, asking for clarification (what exactly is slow, can you elaborate?), saying some idea is actually bad (I got that a few times), and so on… essentially, the AI is starting to force people to work as a partner and give it proper information, not just tell them “it’s broken, fix it” like they used to do on StackOverflow.

  • It is not a tool. It is an oracle.

    It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.

    But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.

    • If I was on the "replace all the meatsacks AGI ftw" team then I would have referred to it as an oracle, by your own logic, wouldn't I have?

      It's a tool. It's good for some things, not for others. Use the right tool for the job and know the job well enough to know which tools apply to which tasks.

      More than anything it's a learning tool. It's also wildly effective at writing code, too. But, man... the things that it makes available to the curious mind are rather unreal.

      I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.

      I didn't know anything about all this stuff before I started. I didn't AGI myself here. I used a learning tool.

      But keep up with your schtick if that's what you want to do.

      8 replies →

  • > Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized.

    ...Really? I think 'hey we have a lot of customers reporting the app is laggy when they do X, could you take a look' is a very reasonable thing to tell your coworker who implemented X.

Don't let it deteriorate so far that it can't recover in one session.

Perform regular sessions dedicated to cleaning up tech debt (including docs).

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

  • I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

    • Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.

    • right, it would have to a specialized tool that you used to do analysis of codebase every now and then, or parts that you thought should be cleaned up.

      Obviously there is a just keep generating more tokens bias in software management, since so many developer metrics over the years do various lines of code style analysis on things.

      But just as experience and managerial programs have over time developed to say this is a bad bias for ranking devs, it should be clear it is a bad bias for LLMs to have.

  • I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

    • deleting and code cleanup is perhaps more an expression of seniority, and personal preferences. Maybe there should be the same kind style transfer with code that you see with graphical generative AI, "rewrite this code path in the style of Donald Knuth"

      1 reply →

I have no idea what I'm doing differently because I haven't experienced this since Opus 4.5. Even with Sonnet 4.5, providing explicit instructions along the lines of "reuse code where sensible, then run static analysis tools at the end and delete unused code it flags" worked really well.

I always watch Opus work, and it is pretty good with "add code, re-read the module, realize some pre-existing code (either it wrote, or was already there) is no longer needed and delete it", even without my explicit prompts.

Yes, this is exactly the experience I have had with LLMs as a non-programmer trying to make code. When it gets too deep into the weeds I have to ask it to get back a few steps.

Yes that’s my observation too. I have to be double careful the longer they run a task. They like to hack and patch stuff even when I tell it I don’t prefer it.

The solution is to know when to use an existing solution like sqlite and when to create your own. So the biggest problem with LLMs is that they don't repel or remind you about possible consequences (too often). But if they would, I would find it even more awkward... and this is one of the reasons I prefer Claude Code over Codex.

I use the restore checkpoint/fork conversation feature in GitHub Copilot heavily because of this. Most of the time it's better to just rewind than to salvage something that's gone off track.

I feel like there's two types of LLM users. Those that understand it's limitations, and those that ask it to solve a millennium problem on the first try.

I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.

The reason theyre not intelligent is becaise they want to predict the next token, so verbosity is baked in.

have you tired adding to your agents file: "Prefer solutions that reduce lines of code over adding lines of code"?

i wonder if the solution is to just ask it to refactor its code once it's working.

  • I do this all the time but then you end up with really over engineered code that has way more issues than before. Then you're back to prompting to fix a bunch of issues. If you didn't write the initial code sometimes it's difficult to know the best way to refactor it. The answer people will say is to prompt it to give you ideas. Well then you're back to it generating more and more code and every time it does a refactor it introduces more issues. These issues aren't obvious though. They're really hard to spot.

  • You can, and it might make things a bit better. The only real way I've found so far is to start going through file by file, picking it apart.

    I wouldn't be surprised if over half my prompts start with "Why ...?", usually followed by "Nope, ... instead”

    Maybe the occasional "Fuck that you idiot, throw the whole thing out"