Comment by gregates
11 hours ago
The version of this I encounter literally every day is:
I ask my coding agent to do some tedious, extremely well-specified refactor, such as (to give a concrete real life example) changing a commonly used fn to take a locale parameter, because it will soon need to be locale-aware. I am very clear — we are not actually changing any behavior, just the fn signature. In fact, at all call sites, I want it to specify a default locale, because we haven't actually localized anything yet!
Said agent, I know, will spend many minutes (and tokens) finding all the call sites, and then I will still have to either confirm each update or yolo and trust the compiler and tests and the agents ability to deal with their failures. I am ok with this, because while I could do this just fine with vim and my lsp, the LLM agent can do it in about the same amount of time, maybe even a little less, and it's a very straightforward change that's tedious for me, and I'd rather think about or do anything else and just check in occasionally to approve a change.
But my f'ing agent is all like, "I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?"
And in that moment I guess I know why some people say having an LLM is like having a junior engineer who never learns anything.
Claude 4.7 broke something while we were working on several failing tests and justified itself like this:
> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.
I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".
> strictly speaking, it was working before and now it isn't
I've been seeing more things like this lately. It's doing the weird kind of passive deflection that's very funny when in the abstract and very frustrating when it happens to you.
What gets me, is when the tests are correct and match the spec/documentation for the behavior, but the LLM will start changing the tests and documentation altogether instead of fixing the broken behavior... having to revert (git reset), tell the agent that the test is correct and you want the behavior to match the test and documentation not the other way around.
I'm usually pretty particular about how I want my libraries structured and used in practice... Even for the projects I do myself, I'll often write the documentation for how to use it first, then fill in code to match the specified behavior.
I've been doing a lot of experimentation with "hands off coding", where a test suite the agents cannot see determines the success of the task. Essentially, it's a Ralph loop with an external specification that determines when the task is done. The way it works is simple: no tests that were previously passing are allowed to fail in subsequent turns. I achieve this by spawning an agent in a worktree, have them do some work and then when they're done, run the suite and merge the code into trunk.
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.
At the risk of being That Old Guy, this seems like a pretty bad workflow regression from what ctags could do 30 years ago
> changing a commonly used fn to take a locale parameter
I have to ask, is this the sort of thing people use agents/AI for?
Because I'd probably reach for sed or awk.
I think about half the IDEs I've ever used just had this as a feature. Right-click on function, click on "change signature", wait a few seconds, verify with `git diff`.
I actually still like LLMs for this. I use rust LSP (rust analyzer) and it supports this, but LLMs will additionally go through and reword all of the documentation, doc links, comments, var names in other funcs in one go, etc.
Are they perfect? Far from it. But it's more comprehensive. Additionally simple refactors like this are insanely fast to review and so it's really easy to spot a bad change or etc. Plus i'm in Rust so it's typed very heavily.
In a lot of scenarios i'd prefer an AST grep over an LSP rename, but hat also doesn't cover the docs/comments/etc.
2 replies →
yeah, and this has the advantage of both being deterministic, and only updating things that are actually linked as opposed to also accidentally updating naming collisions
1 reply →
It's not always amenable to grepping. But this is a great use case for AST searches, and is part of the reason that LSP tools should really be better integrated with agents.
Works fine in algol-like languages (C, C++ for a start) by just changing the function prototype and finding all instances from the compiler errors, using your compiler as the AST explorer ...
Agents do use LSPs.
Programming language are formal, so unless you’re doing magic stuff (eval and reflection), you can probably grep into a file, eliminate false positive cases, then do a bit of awk or shell scripting with sed. Or use Vim or Emacs tooling.
Ah yes, don't fix the agents, fix the tools.
What a ridiculously backwards approach.
We were supposed to get agents who could use human tooling. Instead we are apparently told to write interfaces for this stumbling expensive mess to use.
Maybe, just maybe, if the human can know to, and use, the AST tool fine, the problem is not the tool but the agent.
In general, yes, I might use an LLM for a tedious refactor. In this case I might try <https://github.com/ast-grep/ast-grep> though.
Or the "find all references" feature almost every code editor has...
> I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?
I think some of this is a problem in the agent's design. I've got a custom harness around GPT5.4 and I don't let my agent do any tool calling on the user's conversation. The root conversation acts as a gatekeeper and fairly reliably pushes crap responses like this back down into the stack with "Ok great! Start working on items 1-20", etc.
Ehhhhh, "problem" is a strong word. Sometimes you're throwing out a lot of signal if you don't let the coding agent tell you it thinks your task is a bad idea. I got a PR once attempting to copy half of our production interface because the author successfully convinced Claude his ill-formed requirements had to be achieved no matter what.
there is no use for an automated system that "argues" with your commands. if i ask it to advise me, thats one thing, but if i command it to perform, nothing short of obedience will suffice.
2 replies →
> Maybe we should just commit the signature change with a TODO
I'm fascinated that so many folks report this, I've literally never seen it in daily CC use. I can only guess that my habitually starting a new session and getting it to plan-document before action ("make a file listing all call sites"; "look at refactoring.md and implement") makes it clear when it's time for exploration vs when it's time for action (i.e. when exploring and not acting would be failing).
You can do that in IntelliJ in about 15 seconds and no tokens...
Indeed you can! I don't use IntelliJ at work for [reasons], and LSP doesn't support a change signature action with defaults for new params (afaik). But it really seems like something any decent coding agent ought be able to one shot for precisely this reason, right?
Using a LLM for these tasks really is somewhat like using a Semi to shuttle your home groceries. Absolutely unnecessary, and can be done via a scooter. But if a Semi is all you have you use it for everything. So here we are.
The real deal is, while a Semi can do all the things you can do with a scooter, the opposite is not true.
> But if a Semi is all you have
Seems like a pretty lousy work situation when you have LLMs but no decent IDE.
> the opposite is not true.
You can't ("shouldn't") take a semi on a sidewalk or down a narrow alley.
> while a Semi can do all the things you can do with a scooter
You may be able to lane split in a semi, but it also has excessive environmental impact.
The LLM only has to parse the request and farm out execution to the LSP. It saves you from having to find the function definition.
“Use an agent to…” is much more effective in my experience, because they have no means in communicating with you. They are more likely to just do it
Make it write a script with dry run and a file name list.
You’ll be amazed how good the script is.
My agent did 20 class renames and 12 tables. Over 250 files and from prompt to auditing the script to dry run to apply, a total wall clock time of 7 minutes.
Took a day to review but it was all perfect!
Refactoring already exists.
Asking for code to manipulate the AST is another route. In python it can do absolute magic.
That’s my daily experience too. There are a few more behaviours that really annoys me, like: - it breaks my code, tests start to fail and it instantly says “these are all pre existing failures” and moves on like nothing happened - or it wants to run some a command, I click the “nope” button and it just outputs “the user didn’t approve my command, I need to try again” and I need to click “nope” 10 more times or yell at it to stop - and the absolute best is when instead of just editing 20 lines one after another it decides to use a script to save 3 nanoseconds, and it always results in some hot mess of botched edits that it then wants to revert by running git reset —hard and starting from zero. I’ve learned that it usually saves me time if I never let it run scripts.
> it breaks my code, tests start to fail and it instantly says “these are all pre existing failures” and moves on like nothing happened
Reminds us of the most important button the "AI" has, over the similarly bad human employee.
'X'
Until, of course, we pass resposibility for that button to an "AI".
The other day Codex on Mac gained the ability to control the UI. Will it close itself if instructed though? Maybe test that and make a benchmark. Closebench.
1 reply →
I've never hit that one, do you have a lot of `ToDo`s in your code comments?
If it’s a compiled language, just change the definition and try to compile.
Indeed! You would think it would have some kind of sense that a commit that obviously won't compile is bad!
You would think.
It would be one thing if it was like, ok, we'll temporarily commit the signature change, do some related thing, then come back and fix all the call sites, and squash before merging. But that is not the proposal. The plan it proposes is literally to make what it has identified as the minimal change, which obviously breaks the build, and call it a day, presuming that either I or a future session will do the obvious next step it is trying to beg off.
Pretty sure it’s a harness or system prompt issue.
I have never seen those “minimal change” issues when using zed, but have seen them in claude code and aider. Been using sonnet/opus high thinking with the api in all the agents I have tested/used.
On my compiled language projects I have a stop hook that compiles after every iteration. The agent literally cannot stop working until compilation succeeds.
2 replies →
I've have a different version of the same thing. My pet peeve is that it constantly interprets questions as instructions.
For example, it does a bunch of stuff, and I look at it and I say, "Did we already decide to do [different approach]" And then it runs around and says, "Oh yeah," and then it does a thousand more steps and undoes does what it just did and gets itself into a tangle.
Meanwhile, I asked it a question. The proper response would be to answer the question. I just want to know the answer.
I had it right. That behavior into a core memory, and it seems to have improved for what it's worth.
I've had the agent tell me "this looks like it's going to be a very big change. it could take weeks." - and then I tell it to go ahead and it finishes in 5 minutes because in reality it just needs grep and sed.
I’m skeptical of most “harness hacking”, but this is a situation that calls for it. You need to establish some higher level context or constraint it’s working against.
You need to use explicit instructions like "make a TODO list of all call sites and use sub agents to fix them all".
whats your setup?
[dead]