Comment by alkonaut
12 hours ago
This seems it should be very easy to validate. Force the AI to make minimal changes to the code under test, which makes a single (or as few as possible) test fail as a result. If it can't make a test fail at all, it should be useless.
Agreed, and that's why I think adding some example prompts and ideas to the Testing section would be helpful. A vanilla-prompted LLM, in my experience, is very unreliable at adding tests that fail when the changes are reverted.
Many times I've observed that the tests added by the model simply pass as part of the changes, but still pass even when those changes are no longer applied.
I had an example in that section but it got picked apart by pedants (who had good points) so I removed it. I plan to add another soon. You can still see it in the changelog: https://simonwillison.net/guides/agentic-engineering-pattern...
Matt Pocock has a nice TDD skill he's made available [0][1].
[0] https://www.aihero.dev/skill-test-driven-development-claude-...
[1] https://github.com/mattpocock/skills/blob/main/tdd/SKILL.md
This is essentially dual to the idea behind mutation testing, and should be trivial to do with a mutation testing framework in place (track whether a given test catches mutants, or more sophisticated: whether it catches the exact same mutants as some other test).
That's part of the reason I like red/green TDD - you make the agent show that the test fails before the implementation and passes afterwards.
It can still cheat, but it's less likely to cheat.