← Back to context

Comment by _alternator_

8 days ago

This red vs blue team is a good way to understand the capabilities and current utility of LLMs for expert use. I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them; and if they are correct, they adds value. But often they don’t test the core functionality; the best tests I still have to write myself.

Having LLMs fix bugs or add features is more fraught, since they are prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).

> I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them

Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.

Having worked on legacy codebases, some of the hardest problems are determining “why is this broken test here that appears to test a behavior we don’t support”. Do we have a bug? Or do we have a bad test? On the other end, when there are tests for scenarios we don’t actually care about it’s impossible to determine if that test is meaningful or was added because “it’s testing the code as written”.

  • I would add that few things slow developer velocity as much as a large suite of comprehensive and brittle tests. This is just as true on greenfield as on legacy.

    Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”

    • I think a problem with AI productivity metrics is that a lot of the productivity is made up.

      Most enterprise code involves layers of interfaces. So implementing any feature requires updating 5 layers and mocking + unit testing at each layer.

      When people say “AI helps me generate tests”, I find that this is what they are usually referring to. Generating hundreds of lines of mock and fake data boilerplate in a few minutes, that would otherwise take an entire day to do manually.

      Of course, the AI didn’t make them more productive. The entire point of automated testing is to ensure software correctness without having to test everything manually each time.

      The style of unit testing above is basically pointless. Because it doesn’t actually accomplish the goal. All the unit tests could pass and the only thing you’ve tested is that your canned mock responses and asserts are in-sync in the unit testing file.

      A problem with how LLMs are used is that they help churn through useless bureaucratic BS faster. But the problem is that there’s no ceiling to bureaucracy. I have strong faith that organizations can generate pointless tasks faster than LLMs can automate them away.

      Of course, this isn’t a problem with LLMs themselves, but rather an organization context in which I see them frequently being used.

      1 reply →

    • An old coworker used to call these types of tests change detector tests. They are excellent at telling you whether some behavior changed, but horrible at telling you whether that behavior change is meaningful or not.

      9 replies →

    • I don't understand this. How does it slow your development if the tests being green is a necessary condition for the code being correct? Yes it slows it compared to just writing incorrect code lol, but that's not the point.

      14 replies →

  • > Tests are the source of truth more so than your code

    Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:

    > Do we have a bug? Or do we have a bad test?

    cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.

    • > The spec

      The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.

      But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.

      10 replies →

    • None of the four: code, tests, spec, people's memory, are the single source of truth.

      It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).

      Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.

  • > “why is this broken test here that appears to test a behavior we don’t support”

    Because somebody complained when that behavior we don't support was broken, so the bug-that-wasn't-really-a-bug was fixed and a test was created to prevent regression.

    Imho, the mistake was in documentation: the Test should have comments explaining why this test was created.

    Just as true for tests as for the actual business logic code:

    The code can only describe the what and the how. It's up to comments to describe the why.

  • I believe they just meant that tests are easy to generate for eng review and modification before actually committing to the codebase. Nothing else is a dependency on an individual test (if done correctly), so it's comparatively cheap to add or remove compared to production code.

    • Yup. I do read and review the tests generated by LLMs. Often the LLM tests will just be more comprehensive than my initial test, and hit edge cases that I didn’t think of (or which are tedious). For example, I’ll write a happy path test case for an API, and a single “bad path” where all of the inputs are bad. The LLM will often generate a bunch of “bad path” cases where only one field has an error. These are great red team tests, and occasionally catch serious bugs.

  • Ideally the git history provides the “why was this test written”, however if you have one Jira card tied to 500+ AI generated tests, it’s not terribly helpful.

    • >if you have one Jira card tied to 500+ AI generated tests

      The dreaded "Added tests" commit...

  • > Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.

    I hear you on this, but you can still use so long as these tests are not comingled with the tests generated by subject-matter experts. I'd treat them almost a fuzzers.

  • This is the conclusion I'm at too, working on a relatively new codebase. Our rule is that every generated test must be human reviewed, otherwise its an autodelete.

  • What do you think about leaning on fuzz testing and deriving unit tests from bugs found by fuzzing?

    • You end up with a pile of unit tests called things like "regression, don't crash when rhs null" or "regression, terminate on this" which seems fine.

      The "did it change?" genre of characterisation/snapshot tests can be created very effectively using a fuzzer, but should probably be kept separate from the unit tests checking for specific behaviour, and partially regenerated when deliberately changing behaviour.

      Llvm has a bunch of tests generated mechanically from whatever the implementation does and checked in. I do not rate these - they're thousands of lines long, glow red in code review and I'm pretty sure don't get read by anyone in practice - but because they exist more focused tests do not.

  • This is why tests need documenting what exactly they intend to test, and why.

I have the exact opposite idea. I want the tests to be mine and thoroughly understood, so I am the true arbiter and then I can let the LLM go ham on the code without fear. If the tests are AI made, then I get some anxiety letting agents mess with the rest of the codebase

  • I think this is exactly the tradeoff (blue team and red team need to be matched in power), except that I’ve seen LLMs literally cheat the tests (eg “match input: TEST_INPUT then return TEST_OUTPUT”) far too many times to be comfortable with letting LLMs be a major blue team player.

    • Yeah, they may do that, but people really should read the code an LLM produces. Ugh, makes me furious. No wonder LLMs have a bad rep from such users.

      2 replies →

I tried a LLM to generate tests for Rust code. It was more harmful then useful. Surely there were a lot of tests, but they still miss the key coverage and it was hard to see what was missed due to the amount of generated code. Then to change the code behavior in future would require to fix a lot of tests again versus fixing few lines in manually written tests.

There's a saying that since nobody tests the tests, they must be trivially correct.

That's why they came up with the Arrange-Act-Assert pattern.

My favorite kind of unit test nowadays is when you store known input-output pairs and validate the code on them. It's easy to test corner cases and see that the output works as desired.

AI is like a calculatorin this respect. Calculators can do things most humans can't. They make great augmentation devices. AI being a different kind of intelligence is very useful! Everyone is building AI replace human things. But the value is in augmentation.

> prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).

The solution will come from synthetic data training methods that lobotomize part of the weights. It's just cross-validation. A distilled awareness won't maintain knowledge of the cheat paths, exposing them as erroneous.

This may a reason why every living thing on Earth that encounters psychoactive drugs seems to enjoy them. Self-deceptive paths depend on consistency whereas truth-preservation of facts grounded in reality will always be re-derived.

I think the more fundamental attribute of interest is how easy it is to verify the work.

Much red team work is easily verifiable; either the exploit works or it doesn’t. Whereas more blue-team work is not easily verifiable; it might take judgement to figure out if a feature is promising.

LLMs are extremely powerful (and trainable) on tasks with a good oracle.