Comment by joshuaisaact
2 days ago
I notice one of the things you don't really talk about in the blog post (or if you did, I missed it) is unnecessary tests, which is one of the key problems LLMs have with test writing.
In my experience, if you just ask an LLM to write tests, it'll write you a ton of boilerplate happy path tests that aren't wrong, per se, they're just pointless (one fun one in react is 'the component renders').
How do you plan to handle this?
I actually though about it multiple times over at this point.
You're right, this deserves more attention, and is a valid problem going forward with this app. And I had this problem when just started building, it either generated XSS tests for any user input validation method (even if it used other validators) or just 1 single test case.
For now I attempt to strictly limit the amount of tests for LLM to generate.
This is achieved with "Planner" that plans the tests for each function before any generation happens, that agent is instructed to generate a plan that follows the criteria:
- testCases.category MUST be one of "happy_path" | "edge_case" | "error_handling" | "boundary".
And it is asked to generate 2-3 tests for each category. While this may result in the unnecessary tests, it at least tries to limit the amount of them.
Going forward I believe the best approach would be to tune and tweak the requirements based on the language/framework it detects.
Do a structured code review, with a few passes by Claude or Codex. Have it provide an annotated justification for each test, and flag tests with redundant, low, or no utility within the context of the rest of the tests. Anything that looks questionable to you, call it out on the next pass, and if it's not justified by the time you fully understand the tests, nuke it.
You could automate this, but you'll end up getting rid of useful tests and keeping weird useless ones until the AI gets better at nuance and large codebases.
What I see a lot is a generated test for something I prompt, and the test passes. Then I manually break the test and it fails for a different reason, not what I wanted to verify.
Guess I need to make it generate negative tests?
The automated version of this is mutation testing.
Which is actually probably a solid idea for this exact use case.