← Back to context

Comment by tshaddox

13 hours ago

Do you have an example of the tautological tests you're referring to? What comes to mind to me is genuinely logically tautological tests, like "assert(true || expectedResult == actualResult)" which is a mistake I don't even expect modern AI coding tools to make. But I suspect you're talking about a subtler type of test which at first glance appears useful but actually isn't.

I've definitely seen Opus go to town when asked to test a fairly simple builder. Possibly it inferred something about testing the "contract", and went on to test such properties as

  - none of the "final" fields have changed after calling each method
  - these two immutable objects we just confirmed differ on a property are not the same object

In addition to multiple tests with essentially identical code, multiple test classes with largely duplicated tests etc.

Among many other possible examples, here are a few [0] from Ruby that I've seen in the wild before LLMs, and still see today spat out by LLMs.

0: https://www.codewithjason.com/examples-pointless-rspec-tests...

  • I do see agents pop out tests that look like this occasionally:

      it { expect(classroom).to have_many(:students) }
    

    If I catch them I tell them not to and they remove it again, but a few do end up slipping through.

    I'm not sure that they're particularly harmful any more though. It used to be that they added extra weight to your test suite, meaning when you make changes you have to update pointless tests.

    But if the agent is updating the pointless tests for you I can afford a little bit of unnecessary testing bloat.

    • I don’t love tests like that either, but I’ve seen a lot of them (long before the generative AI era) and heard reasonable people make arguments in favor of them.

      Admittedly, in the absence of halfway competent static type checking, it does seem like a good way to prevent what would be a very bad regression. It doesn’t seem worse than tests which check that a certain property is non-null (when that’s a vital business requirement and you’re using a language without a competent type system).

I don’t have examples but I have an LLM driven project with like…2500 tests and I regularly need to prune:

* no-op tests

* unit tests labeled as integration tests

* skipped tests set to skip because they were failing and the agent didn’t want to fix them

* tests that can never fail

Probably at any given time the tests are 2-4% broken. I’d say about 10% of one-shot tests are bogus if you’re just working w spec + chat and don’t have extra testing harnesses.

For example, you might write a concurrency test, and the agent will cheerfully remove the concurrency and announce that it passes. They get so hung up on making things work in a narrow sense that they lose track of the purpose.