Comment by PaulDavisThe1st

2 months ago

That's not how software development works.

Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.

LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.

23 comments

PaulDavisThe1st

sally_glance 2 months ago

They absolutely can do that if you give them the tools. Seeing Claude (I use it with opencode agents) run curl and playwright to verify and then fix it's implementation was a real 'wow' moment for me.

Q6T46nT668w6i3m 2 months ago
We have different experiences. Often I’ll see Claude, et. al. find creative ways to fulfill the task without satisfying my intent, e.g., changing the implementation plan I specifically asked for, changing tolerances or even tests, and frequently disabling tests.
- sally_glance 2 months ago
  
  Yeah I feel that, if it happens your only way out is to write down a more extensive implementation plan first. For me that is the point where I start regretting to have tried implementing something using AI,.. But admittedly most of the time redacting the implementation plan and running the agent again is still faster than I could have done on my own (I try to make implementation tasks explicit in the form of a markdown file, worked pretty well so far).
- Fr0styMatt88 2 months ago
  
  I see these “you had a different experience than me” comments around AI coding agents a lot and can concur; I’ll have a different experience with Copilot from day-to-day even, sometimes it’s great and other days I give up on using it at all it’s being so bad.
  Makes me honestly wonder — will AGI just give us agents that get into bad moods and not want to work for the day because they’re tired or just don’t feel like it!
  
  1 reply →
- DANmode 2 months ago
  
  Are you a customer?
  
  5 replies →

mapontosevenths 2 months ago

> LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step.

I'm not sure where this idea comes from. Just instruct it to write and run unit tests and document as it goes. All of the ones I've used will happily do so.

You still have to verify that the unit tests are valid, but that's still far less work than skipping them or writing the code/tests yourself.

butlike 2 months ago
I disagree it's less work. It just carte blanche rewrites tests. I've seen it rewrite and rewrite tests to the point of undermining the original test intention. So now instead of intentionally writing code and a new unit test, I need to intentionally go and review EVERY unit test it touched. Every. Time.
It also doesn't necessarily rewrite documentation as implementation changes. I've seen documentation code rot happen within the same coding session.
- mapontosevenths 2 months ago
  
  I've seen it do that as well. Especially Gemini 3 lately.
  I've started to add an instruction to my GEMINI.md after I'm happy with the tests telling it not to edit them, but to still run them.
  I solve the documentation issue the same way. By telling it when and what to update in the .md file.

jimmaswell 2 months ago

> actually verify that the code I just wrote does what I intended it to

That's what the author did when they ran it.

adventured 2 months ago

Claude Opus 4.5 will routinely test its own code before handing it off to you, even with zero instruction to do so.

PaulDavisThe1st 2 months ago
One commercial equivalent to the project I work on, called ProTools (a DAW), has a test "harness" that took 6 people more than a year to write and takes more than a week to execute.
Last month, I made a minor change to our own code and verified that it worked (it did!). Earlier this week, I was notified of an entirely different workflow that had been broken by the change I had made. The only sort of automated testing that would have detected this would have been similar in scope and scale to the ProTools test harness, and neither an individual human nor an LLM is going to run that.
Moreover, that workflow was entirely graphically based, so unless Claude Opus 4.5 or whatever today's flavor of vibe coding LLM agent is has access to a testing system that allows it to inject mouse events into a running instance of our application (hint: it does not), there's no way it could run an effective test for this sort of code change.
I have no doubt that Claude et al. can verify that their carefully defined module does the very limited task it is supposed to do, for cases where "carefully defined" and "very limited" are appropriate. If that's the only sort of coding you do, I am sorry for your loss.
- utopiah 2 months ago
  
  > access to a testing system that allows it to inject mouse events into a running instance of our application
  FWIW that's precisely what https://pptr.dev is all about. To your broader point though designing a good harness itself remains very challenging and requires to actually understand what value for user, software architecture (to e.g. bypass user interaction and test the API first), etc.
  
  4 replies →
- astrange 2 months ago
  
  Claude can do that, yes.
  https://platform.claude.com/docs/en/agents-and-tools/tool-us...
  Although if you want to test a UI app, it's better to do it through accessibility APIs rather than actually looking at the screen and clicking.