Comment by mixedbit
1 year ago
I have experimented with using LLM for improving unit test coverage of a project. If you provide the model with test execution results and updated test coverage information, which can be automated, the LLM can indeed fix bugs and add improvements to tests that it created. I found it has high success rate at creating working unit tests with good coverage. I just used Docker for isolating the LLM-generated code from the rest of my system.
You can find more details about this experiment in a blog post: https://mixedbit.org/blog/2024/12/16/improving_unit_test_cov...
It depends a lot on the language. I recently tried this with Aider, Claude, and Rust, and after writing one function and its tests the model couldn't even get the code compiling, much less the tests passing. After 6-8 rounds with no progress I gave up.
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.
I've found something similar, when you keep telling the LLM what the compiler says, it keeps adding more and more complexity to try to fix the error, and it either works by chance (leaving you with way overengineered code) or it just never works.
I've very rarely seen it simplify things to get the code to work.
I have the same observation, looks like LLMs are highly biased to add complexity to solve problems: for example add explicit handling of the edge-cases I pointed out rather than rework the algorithm to eliminate edge-cases altogether. Almost everytime it starts with something that's 80% correct, then iterate into something that's 90% correct while being super complex, unmaintainable and having no chance to ever cover the last 10%
1 reply →
Hmm, I worked with students in an “intro to programming” type course for a couple years. As far as I’m concerned, “I added complexity until it compiled and now it works but I don’t understand it” is pretty close to passing the Turing test, hahaha.
2 replies →
Suggestion: Now take the code away, and have the chatbot generate code that passes the tests it wrote.
(In theory, you get a clean-room implementation of the original code. If you do this please ping me because I'd love to see the results.)
That’s sort of interesting. If code -> tests -> code is enough to get a clean room implementation, really, I wonder if this sort of tool would test that.
I don't think it is, but I'm really interested to see someone try it (I'm also lazy).
(And a more philosophical question: if it's not enough, what does that mean for continuous deployment?)