← Back to context

Comment by KurSix

6 hours ago

There's a catch with 100% coverage. If the agent writes both the code and the tests, we risk falling into a tautology trap. The agent can write flawed logic and a test that verifies that flawed logic (which will pass). 100% coverage only makes sense if tests are written before the code or rigorously verified by a human. Otherwise, we're just creating an illusion of reliability by covering hallucinations with tests. An "executable example" is only useful if it's semantically correct, not just syntactically

All the problems you list are true, but the solutions not so much.

I've seen this problem with humans even back at university when it was the lecturer's own example attempting to illustrate the value of formal methods and verification.

I would say the solution is neither "get humans to do it" nor "do it before writing code", but rather "get multiple different minds involved to check each other's blind spots, and no matter how many AI models you throw at it they only count as one mind even when they're from different providers". Human tests and AI code, AI tests and human code, having humans do code reviews of AI code or vice-versa, all good. Two different humans usually have different blind spots, though even then I've seen some humans bully their way into being the only voice in the room with the full support of their boss, not that AI would help with that.

That’s why you’ve gotta test your tests. Insert bugs and ensure they fail.

As the sibling comments alluded to, it’s not exclusively an AI problem since multiple people can miss the issue too.

It’s wonderful that AI is an impetus for so many people to finally learn proper engineering principles though!

  • but who will test the tests of tests?

    • The double entry technique is the most effective path to ensure accuracy (best tradeoffs for time vs accuracy) in finance and software. ie Triple book accounting has not been the standard because it's a bad tradeoff. It requires a large increase in time and effort, for rare increases in accuracy.

I think the phase change hypothesis* is a bit wrong.

I think it happens not at 100% coverage but at, say, 100% MC/DC test coverage. This is what SQLite and avionics software aim for.

*has not been confirmed by a peer-reviewed research.

  • What's MC/DC?

    • Modified Condition/Decision Coverage (MC/DC) is a test coverage approach that considers a chunk of code covered if:

      - Every branch was "visited". Plain coverage already ensures that. I would actually advocate for 100% branch coverage before 100% line coverage.

      - Every part (condition) of a branch clause has taken all possible values. If you have if(enabled && limit > 0), MC/DC requires you to test with enabled, !enabled, limit >0, limit <=0.

      - Every change to the condition was shown to somehow change the outcome. (false && limit > 0) would not pass this, a change to the limit would not affect the outcome - the decision is always false. But @zweifuss has a better example.

      - And, of course, every possible decision (the outcome of the entire 'enabled && limit > 0') needs to be tested. This is what ensures that every branch is taken for if statements, but also for switch statements that they are exhaustive etc.

      MC/DC is usually required for all safety-critical code as per NASA, ESA, automotive (ISO 26262) and industrial (IEC 61508).

      2 replies →

    • Modified Condition/Decision Coverage

      It's mandated by DO-178C for the highest-level (Level A) avionics software.

      Example: if (A && B || C) { ... } else { ... } needs individual tests for A, B, and C.

      Test #,A,B,A && B,Outcome taken,Shows independence for

      1,True,True,True,if branch,(baseline true)

      2,False,True,False,else branch,A (A flips outcome while B fixed at True)

      3,True,False,False,else branch,B (B flips outcome while A fixed at True)

    • Basically branch coverage but also all variations of the predicates, e.g. testing both true || true, and true || false

You’re right. What I like doing in those cases is to review very closely the tests and the assertions. Frequently it’s even faster than looking at the SUT itself.

  • I heard this “review very closely” thing many times, and rarely means review very closely. Maybe 5% of developers really do this ever, and I probably overestimate it. When people send here AI generated code, it’s quite obvious that they don’t review code properly. There are videos when people recorded how we should use LLMs, and they clearly don’t do this.

    • Yeah. This is me. I try, but I always miss something. The sheer volume and occasional stupidity makes it difficult. Spot checking only gets you so far. Often, the code is excellent except in one or two truly awful spots where it does something crazy.

Well, we let humans write both business logic code and tests often enough, too.

Btw, you can get a lot further in your tests, if you move away from examples, and towards properties.