Comment by jihadjihad

12 hours ago

I wish there was a little more color in the Testing and QA section. While I agree with this:

  > A comprehensive test suite is by far the most effective way to keep those features working.

there is no mention at all about LLMs' tendency to write tautological tests--tests that pass because they are defined to pass. Or, tests that are not at all relevant or useful, and are ultimately noise in the codebase wasting cycles on every CI run. Sometimes to pass the tests the model might even hardcode a value in a unit test itself!

IMO this section is a great place to show how we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.

Do you have an example of the tautological tests you're referring to? What comes to mind to me is genuinely logically tautological tests, like "assert(true || expectedResult == actualResult)" which is a mistake I don't even expect modern AI coding tools to make. But I suspect you're talking about a subtler type of test which at first glance appears useful but actually isn't.

  • I've definitely seen Opus go to town when asked to test a fairly simple builder. Possibly it inferred something about testing the "contract", and went on to test such properties as

      - none of the "final" fields have changed after calling each method
      - these two immutable objects we just confirmed differ on a property are not the same object
    

    In addition to multiple tests with essentially identical code, multiple test classes with largely duplicated tests etc.

  • Among many other possible examples, here are a few [0] from Ruby that I've seen in the wild before LLMs, and still see today spat out by LLMs.

    0: https://www.codewithjason.com/examples-pointless-rspec-tests...

    • I do see agents pop out tests that look like this occasionally:

        it { expect(classroom).to have_many(:students) }
      

      If I catch them I tell them not to and they remove it again, but a few do end up slipping through.

      I'm not sure that they're particularly harmful any more though. It used to be that they added extra weight to your test suite, meaning when you make changes you have to update pointless tests.

      But if the agent is updating the pointless tests for you I can afford a little bit of unnecessary testing bloat.

      1 reply →

  • I don’t have examples but I have an LLM driven project with like…2500 tests and I regularly need to prune:

    * no-op tests

    * unit tests labeled as integration tests

    * skipped tests set to skip because they were failing and the agent didn’t want to fix them

    * tests that can never fail

    Probably at any given time the tests are 2-4% broken. I’d say about 10% of one-shot tests are bogus if you’re just working w spec + chat and don’t have extra testing harnesses.

  • For example, you might write a concurrency test, and the agent will cheerfully remove the concurrency and announce that it passes. They get so hung up on making things work in a narrow sense that they lose track of the purpose.

Yes. And, a bad test -- that passes because it's defined to pass -- is _much worse_ than no test at all. It makes you think an edge case is "covered" with a meaningful check.

Worse: once you have one "bad apple" in your pile of tests, it decreases trust in the _whole batch of tests_. Each time a test passes, you have to think if it's a bad test...

That's where mutation testing becomes even more valuable. If the test still passes after the code has been mutated, then you may want to look deeper, because it's a sign that the test is not good.

This seems it should be very easy to validate. Force the AI to make minimal changes to the code under test, which makes a single (or as few as possible) test fail as a result. If it can't make a test fail at all, it should be useless.

  • Agreed, and that's why I think adding some example prompts and ideas to the Testing section would be helpful. A vanilla-prompted LLM, in my experience, is very unreliable at adding tests that fail when the changes are reverted.

    Many times I've observed that the tests added by the model simply pass as part of the changes, but still pass even when those changes are no longer applied.

  • This is essentially dual to the idea behind mutation testing, and should be trivial to do with a mutation testing framework in place (track whether a given test catches mutants, or more sophisticated: whether it catches the exact same mutants as some other test).

  • That's part of the reason I like red/green TDD - you make the agent show that the test fails before the implementation and passes afterwards.

    It can still cheat, but it's less likely to cheat.

> we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.

I have a hard enough time getting humans to write tests like this…