Comment by btrettel
9 hours ago
Similar to bragging about LOC, I have noticed in my own field of computational fluid dynamics that some vibe coders brag about how large or rigorous their test suites are. The problem is that whenever I look more closely into the tests, the tests are not outstanding and less rigorous than my own manually created tests. There often are big gaps in vibe coded tests. I don't care if you have 1 million tests. 1 million easy tests or 1 million tests that don't cover the right parts of the code aren't worth much.
Yes, I've found tests are the one thing I need to write. I then also need to be sure to keep 'git diff'ing the tests, to make sure claude doesn't decide to 'fix' the tests when it's code doesn't work.
When I am rigourous about the tests, Claude has done an amazing job implementing some tricky algorithms from some difficult academic papers, saving me time overall, but it does require more babysitting than I would like.
Give claude a separate user, make the tests not writable for it. Generally you should limit claude to only have write access to the specific things it needs to edit, this will save you tokens because it will fail faster when it goes off the rails.
Don't even need a separate user if you're on linux (or wsl), just use the sandbox feature, you can specify allowed directories for read and/or write.
The sandbox is powered by bubblewrap (used by Flatpaks) so I trust it.
The “red/green TDD” (ie. actual tdd) and mutation testing (which LLMs can help with) are good ways to keep those tests under control.
Not gonna help with the test code quality, but at least the tests are going to be relevant.
The trick is crafting the minimal number of tests.
it is like reward hacking, where the reward function in this case the test is exploited to achieve its goals. it wants to declare victory and be rewarded so the tests are not critical to the code under test. This is probably in the RL pre-training data, I am of course merely speculating.
It's a struggle to get LLMs to generate tests that aren't entirely stupid.
Like grepping source code for a string. or assert(1==1, true)
You have to have a curated list of every kind of test not to write or you get hundreds of pointless-at-best tests.
What I've observed in computational fluid dynamics is that LLMs seem to grab common validation cases used often in the literature, regardless of the relevance to the problem at hand. "Lid-driven cavity" cases were used by the two vibe coded simulators I commented on at r/cfd, for instance. I never liked the lid-driven cavity problem because it rarely ever resembles an actual use case. A way better validation case would be an experiment on the same type of problem the user intends to solve. I think the lid-driven cavity problem is often picked in the literature because the geometry is easy to set up, not because it's relevant or particularly challenging. I don't know if this problem is due to vibe coders not actually having a particular use case in mind or LLMs overemphasizing what's common.
LLMs seem to also avoid checking the math of the simulator. In CFD, this is called verification. The comparisons are almost exclusively against experiments (validation), but it's possible for a model to be implemented incorrectly and for calibration of the model to hide that fact. It's common to check the order-of-accuracy of the numerical scheme to test whether it was implemented correctly, but I haven't seen any vibe coders do that. (LLMs definitely know about that procedure as I've asked multiple LLMs about it before. It's not an obscure procedure.)
Both of these points seem like they would be easy to instruct an LLM to shape its testing strategy.
1 reply →
> have a curated list of every kind of test not to write
I've seen a lot of people interact with LLMs like this and I'm skeptical.
It's not how you'd "teach" a human (effectively). Teaching (humans) with positive examples is generally much more effective than with negative examples. You'd show them examples of good tests to write, discuss the properties you want, etc...
I try to interact with LLMs the same way. I certainly wouldn't say I've solved "how to interact with LLMs" but it seems to at least mostly work - though I haven't done any (pseudo-)scientific comparison testing or anything.
I'm curious if anyone else has opinions on what the best approach is here? Especially if backed up by actual data.
It's going to be difficult for anyone to have any more "data" than you already do. It's early days for all of us. It's not like there's anyone with 20 years of 2026 AI coding assistant experience.
However we can say based on the architecture of the LLMs and how they work that if you want them to not do something, you really don't want to mention the thing you don't want them to do at all. Eventually the negation gets smeared away and the thing you don't want them to do becomes something they consider. You want to stay as positive as possible and flood them with what you do want them to do, so they're too busy doing that to even consider what you didn't want them to do. You just plain don't want the thing you don't want in their vector space at all, not even with adjectives hanging on them.
I don't have much data to go on (in accordance with what 'jerf wrote), however I offer a high-level, abstract perspective.
The ideal set of outcomes exist as a tiny subspace of a high-dimensional space of possible solutions. Almost all those solutions are bad. Giving negative examples is removing some specific bits of the possibility space from consideration[0] - not very useful, since almost everything else that remains is bad too. Giving positive examples is narrowing down the search area to where the good solutions are likely to be - drastically more effective.
A more humane intuition[1], something I've observed as a parent and also through introspection. When I tell my kid to do something, and they don't understand WTF it is that I want, they'll do something weird and entirely undesirable. If I tell them, "don't do that - and also don't do [some other thing they haven't even thought of yet]", it's not going to improve the outcome; even repeated attempts at correction don't seem effective. In contrast, if I tell (or better, show) them what to do, they usually get the idea quickly, and whatever random experiments/play they invent, is more likely to still be helpful.
--
[0] - While paradoxically also highlighting them - it's the "don't think of a pink elephant" phenomenon.
[1] - Yes, I love anthropomorphizing LLMs, because it works.
It's not a person. Unlike a person it has a tremendous "memory" of everything ever done its creators could get access to.
If I tell it what to do, I bias it towards doing those things and limit its ability to think of things I didn't think of myself, which is what I want in testing. In separate passes, sure a pass where I prescribe types and specific tests is effective. But I also want it to think of things I didn't, a prompt like "write excellent tests that don't break these rules..." is how you get that.