Comment by tptacek

1 day ago

Which is why you dictate series of tests for the LLM to generate, and then it generates way more test coverage than you ordinarily would have. Give it a year, and LLMs will be doing test coverage and property testing in closed-loop configurations. I don't think this is a winnable argument!

Certainly, most of the "interesting" decisions are likely to stay human! And it may never be reasonable to just take LLM vomit and merge it into `main` without reviewing it carefully. But this idea people have that LLM code is all terrible --- no, it very clearly is not. It's boring, but that's not the same thing as bad; in fact, it's often a good thing.

   Program testing can be used to show the presence of bugs, but never to show their absence!

Edgar Dijkstra, Notes on Structured Programming.

> it generates way more test coverage than you ordinarily would have.

Test coverage is a useless metric. You can cover the code multiple time and not test the right values. Nor test the right behavior.

  • You don't do it for bugs, you do it for features in this case.

    Contrived example: You want a program that prints out the weather for the given area.

    First you write the tests (using AI if you want) that test for the output you want.

    Then you tell the AI to implement the code that will pass the tests and explicitly tell it NOT to fuck with the tests (as Claude 3.7 specifically will do happily, it'll mock the tests so far it's not touching a line of the actual code to be tested...)

    With bugs you always write a test that confirms the exact case the bug caused so that it doesn't reappear. This way you'll slowly build a robust test suite. 1) find bug 2) write test for correct case 3) fix code until test passes

  • Don't get hung up on the word "coverage". We all know test coverage isn't a great metric.

    I just used IntelliJ AI to generate loads of tests for some old code I couldn't be bothered to finish.

    It wrote tests I wouldn't have written even if I could be bothered. So the "coverage" was certainly better. But more to the point, these were good tests that dealt with some edge cases that were nice to have.