← Back to context

Comment by yojo

8 days ago

I would add that few things slow developer velocity as much as a large suite of comprehensive and brittle tests. This is just as true on greenfield as on legacy.

Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”

I think a problem with AI productivity metrics is that a lot of the productivity is made up.

Most enterprise code involves layers of interfaces. So implementing any feature requires updating 5 layers and mocking + unit testing at each layer.

When people say “AI helps me generate tests”, I find that this is what they are usually referring to. Generating hundreds of lines of mock and fake data boilerplate in a few minutes, that would otherwise take an entire day to do manually.

Of course, the AI didn’t make them more productive. The entire point of automated testing is to ensure software correctness without having to test everything manually each time.

The style of unit testing above is basically pointless. Because it doesn’t actually accomplish the goal. All the unit tests could pass and the only thing you’ve tested is that your canned mock responses and asserts are in-sync in the unit testing file.

A problem with how LLMs are used is that they help churn through useless bureaucratic BS faster. But the problem is that there’s no ceiling to bureaucracy. I have strong faith that organizations can generate pointless tasks faster than LLMs can automate them away.

Of course, this isn’t a problem with LLMs themselves, but rather an organization context in which I see them frequently being used.

  • I think it's appropriate to be skeptical with new tools, and being appropriately, respectfully, prosocially, skeptical, point out failure modes. Kudos.

    Something that crosses my mind is if AI generating tests necessitates that it only generates tests with fakes and stubs that exercise no actual logic, the expertise required to notice that, and if it is correctable.

    Yesterday, I was working on some OAuth flow stuff. Without replayed responses, I'm not quite sure how I'd test it without writing my own server, and I'm not sure how I'd develop the expertise to do that without, effectively, just returning the responses I expected.

    It reminds me that if I eschewed tests with fakes and stubs as untrustworthy in toto, I'd be throwing the baby with the bathwater.

An old coworker used to call these types of tests change detector tests. They are excellent at telling you whether some behavior changed, but horrible at telling you whether that behavior change is meaningful or not.

  • Yup. Working on a 10 year old codebase, I always wondered whether a test failing was "a long-standing bug was accidentally fixed" or "this behavior was added on purpose and customers rely on it". It can be about 50/50 but you're always surprised.

    Change detector tests add to the noise here. No, this wasn't a feature customers care about, some AI added a test to make sure foo.go line 42 contained less than 80 characters.

    • I like calling out behavioral vs normative tests. The difference is optics, mostly, but the mere fact that somebody took the time to add a line of comment to ten or hundred lines of mostly boilerplate tests is usually more than enough to raise an eyebrow and I honestly don’t need more than just a pinch of surprise to make the developer pause.

    • > a long-standing bug was accidentally fixed

      In some cases (e.g. in our case) long standing bugs become part of the API that customers rely on.

      5 replies →

  • These sorts of tests are invaluable for things like ensuring adherence to specifications such as OAuth2 flows. A high-level test that literally describes each step of a flow will swiftly catch odd changes in behavior such as a request firing twice in a row or a well-defined payload becoming malformed. Say a token validator starts misbehaving and causes a refresh to occur with each request (thus introducing latency and making the IdP angry). That change in behavior would be invisible to users, but a test that verified each step in an expected order would catch it right away, and should require little maintenance unless the spec itself changes.

I don't understand this. How does it slow your development if the tests being green is a necessary condition for the code being correct? Yes it slows it compared to just writing incorrect code lol, but that's not the point.

  • "Brittle" here means either:

    1) your test is specific to the implementation at the time of writing, not the business logic you mean to enforce.

    2) your test has non-deterministic behavior (more common in end-to-end tests) that cause it to fail some small percentage of the time on repeated runs.

    At the extreme, these types of tests degenerate your suite into a "change detector," where any modification to the code-base is guaranteed to make one or more tests fail.

    They slow you down because every code change also requires an equal or larger investment debugging the test suite, even if nothing actually "broke" from a functional perspective.

    Using LLMs to litter your code-base with low-quality tests will not end well.

  • The problem is that sometimes it is not a necessary condition. Rather, the tests might have been checking implementation details or just been wrong in the first place. Now, when tests fails I have extra work to figure out if its a real break or just a bad test.

  • The goal of tests is not to prevent you from changing the behavior of your application. The goal is to preserve important behaviors.

    If you can't tell if a test is there to preserve existing happenstance behavior, or if it's there to preserve an important behavior, you're slowed way down. Every red test when you add a new feature is a blocker. If the tests are red because you broke something important, great. You saved weeks! If the tests are red because the test was testing something that doesn't matter, not so great. Your afternoon was wasted on a distraction. You can't know in advance whether something is a distraction, so this type of test is a real productivity landmine.

    Here's a concrete, if contrived, example. You have a test that starts your app up in a local webserver, and requests /foo, expecting to get the contents of /foo/index.html. One day, you upgrade your web framework, and it has decided to return a 302 Moved redirect to /foo/index.html, so that URLs are always canonical now. Your test fails with "incorrect status code; got 302, want 200". So now what? Do you not apply the version upgrade? Do you rewrite the test to check for a 302 instead of a 200? Do you adjust the test HTTP client to follow redirects silently? The problem here is that you checked for something you didn't care about, the HTTP status, instead of only checking for what you cared about, that "GET /foo" gets you some text you're looking for. In a world where you let the HTTP client follow redirects, like human-piloted HTTP clients, and only checked for what you cared about, you wouldn't have had to debug this to apply the web framework security update. But since you tightened down the screws constraining your application as tightly as possible, you're here debugging this instead of doing something fun.

    (The fun doubles when you have to run every test for every commit before merging, and this one failure happened 45 minutes in. Goodbye, the rest of your day!)

  • It's that hard to write specs that truly match the business, hence why test-driven-development or specification-first failed to take off as a movement.

    Asking specs to truly match the business before we begin using them as tests would handcuff test people in the same way we're saying that tests have the potential to handcuff app and business logic people — as opposed to empowering them. So I wouldn't blame people for writing specs that only match the code implementation at that time. It's hard to engage in prophecy.

    • The problem with TDD is that people assumed it was writing a specification, or directly tried to map it directly to post-hoc testing and metrics.

      TDD at its core is defining expected inputs and mapping those to expected outputs at the unit of work level, e.g. function, class etc.

      While UAT and domain informed what those inputs=outputs are, avoiding trying to write a broader spec that that is what many people struggle with when learning TDD.

      Avoiding writing behavior or acceptance tests, and focusing on the unit of implementation tests is the whole point.

      But it is challenging for many to get that to click. It should help you find ambiguous requirements, not develop a spec.

      5 replies →

    • > So I wouldn't blame people for writing specs that only match the code implementation at that time.

      WFT are you doing writing specs based on implementation? If you already have the implementation, what are you using the specs for? Or, if you want to apply this direct to tests, if you are already assuming the program is correct, what are you trying to test?

      Are you talking about rewriting applications?

      2 replies →