Comment by kaydub

1 month ago

Yeah, it sounds like "you're holding it wrong"

Like, why are you manually tidying and fixing things? The first pass is never perfect. Maybe the functionality is there but the code is spaghetti or untestable. Have another agent review and feed that review back into the original agent that built out the code. Keep iterating like that.

My usual workflow:

Agent 1 - Build feature Agent 2 - Review these parts of the code, see if you find any code smells, bad architecture, scalability problems that will pop up, untestable code, or anything else falling outside of modern coding best practices Agent 1 - Here's the code review for your changes, please fix Agent 2 - Do another review Agent 1 - Here's the code review for your changes, please fix

Repeat until testable, maybe throw in a full codebase review instead of just the feature.

Agent 1 - Code looks good, start writing unit tests, go step by step, let's walk through everything, etc. etc. etc.

Then update your .md directive files to tell the agents how to test.

Voila, you have an llm agent loop that will write decent code and get features out the door.

17 comments

kaydub

joshstrange 1 month ago

I'm not trying to be rude here at all but are you manually verifying any of that? When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value. And that's when they aren't bypassing commit checks or just commenting out tests or saying "I fixed it all" while multiple tests are broken.

Maybe I need a stricter harness but I feel like I did try that and still didn't get good results.

kaydub 1 month ago
I feel like it was doing what you're saying about 4-6 months ago. Especially the commenting out tests. Not always but I'd have to do more things step by step and keep the llm on track. Now though, the last 3-4 months, it's writing decent unit tests without much hand holding or refactors.
- joshstrange 1 month ago
  
  Hmm, my last experience was within the last 2 months but I'm trying not to write it off as "this sucked and will always suck", that's the #1 reason I keep testing and playing with these things, the capabilities are increasing quickly and what did/didn't work last week (especially "last model") might work this week.
  I'll keep testing it but that just hasn't been my experience, I sincerely hope that changes because an agent that runs unit test [0] and can write them would be very powerful.
  [0] This is a pain point for me. The number of times I've watching Claude run "git commit --no-verify"... I've told it in CLAUDE.md to never bypass commit checks, I've told it in the prompt, I've added it 10 more times in different places in CLAUDE.md but still, the agent will always reach for that if it can't fix something in 1-3 iterations. And yes, I've told it "If you can't get the checks to pass then ask me before bypassing the checks".
  It doesn't matter how many guardrails I put up and how good they are if the agent will lazily bypass them at the drop of a hat. I'm not sure how other people are dealing with this (maybe with agents managing agents and checking their work? A la Gas Town?).
  
  8 replies →
- Paracompact 1 month ago
  
  Literally yesterday I was using Claude for writing a SymPy symbolic verification of a mathematical assertion it was making with respect to some rigorous algebra/calculus I was having it do for me. This is the best possible hygiene I could adopt for checking its output, and it still failed to report on results correctly.
  After manual line-by-line inspection and hand-tweaks, it still saved me time. But it's going to be a long, long time before I no longer manually tweak things or trust that there are no silent mistakes.
Shebanator 1 month ago

Those kinds of errors were super common 4-6 months ago, but LLM quality moves fast. Nowadays I don't see these very often at all. Two things that make a huge difference: work on writing a spec first. github.speckit, GSD, BMAD, whatever tool you like can help with this. Do several passes on the spec to refine it and focus on the key ideas.
Now that you have a spec, task it out, but tell the LLM to write the tests first (like Test-Driven Development, but without all the formalisms). This forces the LLM to focus on the desired behavior instead of the algorithms. Be sure to focus on tests that focus on real behavior: client apis doing the right error handling when you get bad input, handling tricky cases, etc. Tell the system not to write 'struct' tests - checking that getters/setters work isn't interesting or useful.
Then you implement 1-3 tasks at a time, getting the tests to pass. The rules prevent disabling tests, commenting out tests, and, most importantly, changing the behavior of the tests. Doesn't use a lot of context, little to no hallucinating, and easily measurable progress.
enraged_camel 1 month ago
>> When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value.
This has not happened to me since Sonnet 4.5. Opus 4.5 is especially robust when it comes to writing tests. I use it daily in multiple projects and verify the test code.
- joshstrange 1 month ago
  
  I thought I did use Opus 4.5 when I tested this last time but I might have still been on the $20 plan and I cannot remember if you get any Opus 4.5 on that in Claude Code (I thought you did with really low limits?), so maybe I wasn't using Opus 4.5, I will need to try again.

kapimalos 1 month ago

I haven’t used multi-agent set up yet but it’s intriguing.

Are you using Claude Code? How do you run the agents and make them speak?

kaydub 1 month ago

Let me clarify actually, I run separate terminals and the agents are separated. I think claude code cli is the best. But at home I pay per token. I have a google account and I pay for chatgpt. So I often use codex and gemini cli in tandem. I'll copy + paste stuff between them sometimes or I'll have one review the changes or just the code in general and then feed the other with the outputs. I'll break out claude code for specific tasks or when I feel like gemini/chatgpt aren't quite doing the job right (which has gotten rarer the past few months).
I messed around with separate "agents" in the same context window for a while. I even went as far as playing with strands agents. Having multiple agents was a crapshoot.
Sometimes they'd work great, but sometimes they start working on the same files at the same time, argue with each other, etc. I'd always get multiple agents working, at least how I assumed they should work, by telling the llm explicitly what agents to create and what work to pass off to what agents. And it did a pretty poor job of that. I tried having orchestration agents, but at a certain point the orchestration agent would just takeover and do everything. So I'm not big on having multiple agents (in theory it sounds great, especially since they are supposed to each have their own context window). When I attempted doing this kind of stuff with strands agents it honestly felt like I was trying to recreate claude, so I just stick with plain cli llm tools for now.