Comment by boredtofears

12 days ago

Here's an example:

I recently inherited an over decade old web project full of EOL'd libraries and OS packages that desperately needed to be modernized.

Within 3 hours I had a working test suite with 80% code coverage on core business functionality (~300 tests). Now - maybe the tests aren't the best designs given there is no way I could review that many tests in 3 hours, but I know empirically that they cover a majority of the code of the core logic. We can now incrementally upgrade the project and have at least some kind of basic check along the way.

There's no way I could have pieced together as large of a working test suite using tech of that era in even double that time.

> maybe the tests aren't the best designs given there is no way I could review that many tests in 3 hours,

If you haven't reviewed and signed off then you have to assume that the stuff is garbage.

This is the crux of using AI to create anything and it has been a core rule of development for many years that you don't use wizards unless you understand what they are doing.

  • I used a static analysis code coverage tool to guarantee it was checking the logic, but I did not verify the logic checking myself. The biggest risk is that I have no way of knowing that I codified actual bugs with tests, but if that's true those bugs were already there anyways.

    I'd say for what I'm trying to do - which is upgrade a very old version of PHP to something that is supported, this is completely acceptable. These are basically acting as smoke tests.

    • > code coverage

      You need to be a bit careful here. A test that runs your function and then asserts something useless like 'typeof response == object' will also meet those code coverage numbers.

      In reality, modern LLMs write tests that are more meaningful than that, but it's still worth testing the assumption and thinking up your own edge cases.

I code firmware for a heavily regulated medical device (where mistakes mean life and death), and I try to have AI write unit tests for me all the time, and I would say I spend about 3 days correcting and polishing what the AI gives me in 30 minutes. The first pass the AI gives me, likely saves a day of work, but you would have to be crazy to trust it blindly. I guarantee it is not giving you what you think it is or what you need. And writing the tests is when I usually find and fix issues in the code. If AI is writing tests that all pass without updating the code then it's likely falsely telling you the code is perfect when it isn't.

  • If you're using a code coverage tool to identify the branches its hitting in the code, you at least have a guarantee that it is testing the code its writing tests for as long as you check the assertions. I could be codifying bugs with tests and probably did (but they were already there anyways). For the purpose of upgrading OS libraries and surrounding software, this is a good approach - I can incrementally upgrade the software, run all the tests, and see if anything falls over.

    I'm not having AI write tests for life-or-death software nor did I claim that AI wrote tests that all pass without updating any code.

You know they cause a majority of the code of the core logic to execute, right? Are you sure the tests actually check that those bits of logic are doing the right thing? I've had Claude et al. write me plenty of tests that exercise things and then explicitly swallow errors and pass.

  • Yes, the first hour or so was spent fidgeting with test creation. It started out doing it's usual whacky behavior like checking the existence of a method and calling that a "pass", creating a mock object that mocked the return result of the logic it was supposed to be testing, and (my favorite) copying the logic out of the code and putting it directly into the test. Lots of course correction, but once I had one well written test that I had fully proofed myself I just provided it that test as an example and it did a pretty good job following those patterns for the remainder. I still sniffed out all the output for LLM whackiness though. Using a code coverage tool also helps a lot.

... Yeah thise tests are probably garbage. The models probably covered the 80% that consists of boiler plate and mocked out the important 20% that was critical business logic. That's how it was in my experience.

For God's sake that's completely slop.

  • You should read my other comment - I did check that the test was actually checking the logic, so I guess I did some level of review with it.