Comment by pshirshov
6 days ago
Essentially, I've tried to throw a task which, I thought, Claude won't handle. It did with minimal supervision. Some things had to be done in "adversarial" mode where Claude coded and Codex criticized/reviewed, but it is what it is. An LLM was able to implement generics and many other language features with very little supervision in less than a day o_O.
I've been thrilled to see it using GDB with inhuman speed and efficiency.
I am very impressed with the kind of things people pull out of Claude's жопа but can't see such opportunities in my own work. Is success mostly the result of it being able to test its output reliably, and of how easy it is to set up the environment for this testing?
> Is success mostly the result of it being able to test its output reliably, and of how easy it is to set up the environment for this testing?
I won't say so. From my experience the key to success is the ability to split big tasks into smaller ones and help the model with solutions when it's stuck.
Reproducible environments (Nix) help a lot, yes, same for sound testing strategies. But the ability to plan is the key.
One other thing I've observed is that Claude fares much better in a well engineered pre-existing codebase. It adopts to most of the style and has plenty of "positive" examples to follow. It also benefits from the existing test infrastructure. It will still tend to go in infinite loops or introduce bugs and then oscillate between them, but I've found it to be scarily efficient at implement medium sized features in complicated codebases.
1 reply →
Claude will also tend to go for the "test-passing" development style where it gets super fixated on making the tests pass with no regards to how the features will work with whatever is intended to be built later.
I had to throw away a couple days worth of work because the code it built to pass the tests wasn't able to do the actual thing it was designed for and the only workaround was to go back and build it correctly while, ironically, still keeping the same tests.
You kind of have to keep it on a short leash but it'll get there in the end... hopefully.
жопа -> jopa (zhopa) for those who don't spot the joke
> Some things had to be done in "adversarial" mode where Claude coded and Codex criticized/reviewed
How does one set up this kind of adversarial mode? What tools would you need to use? I generally use Cline or KiloCode - is this possible with those?
You can either use the orchestrator mode and tell it that it must run a subtask that reviews changes after every successful sub-task is done (works in RooCode, I’m guessing KiloCode should also have the feature).
Or you can just switch the models in a regular conversation and tell one to review everything up until now, optionally telling it to get a git diff of all the unstaged changes.
My own (very dirty) tool, there are some public ones, probably I'll try to migrate to one of the more mature tools later. Example: https://github.com/ruvnet/claude-flow
> is this possible with those?
You can always write to stdin/read from stdout even if there is no SDK available I guess. Or create your own agent on top of an LLM provider.
how did you get gdb working with Claude? There are a few mcp servers that looks fine, curious what you used
Well, just told it to use gdb when necessary, MCP wasn't required at all! Also it helps to tell it to integrate cpptrace and always look at the stacks.
MCP is more or less obsolete for code generation since agents can just run CLI tools directly.