Comment by colechristensen
10 hours ago
It's a struggle to get LLMs to generate tests that aren't entirely stupid.
Like grepping source code for a string. or assert(1==1, true)
You have to have a curated list of every kind of test not to write or you get hundreds of pointless-at-best tests.
What I've observed in computational fluid dynamics is that LLMs seem to grab common validation cases used often in the literature, regardless of the relevance to the problem at hand. "Lid-driven cavity" cases were used by the two vibe coded simulators I commented on at r/cfd, for instance. I never liked the lid-driven cavity problem because it rarely ever resembles an actual use case. A way better validation case would be an experiment on the same type of problem the user intends to solve. I think the lid-driven cavity problem is often picked in the literature because the geometry is easy to set up, not because it's relevant or particularly challenging. I don't know if this problem is due to vibe coders not actually having a particular use case in mind or LLMs overemphasizing what's common.
LLMs seem to also avoid checking the math of the simulator. In CFD, this is called verification. The comparisons are almost exclusively against experiments (validation), but it's possible for a model to be implemented incorrectly and for calibration of the model to hide that fact. It's common to check the order-of-accuracy of the numerical scheme to test whether it was implemented correctly, but I haven't seen any vibe coders do that. (LLMs definitely know about that procedure as I've asked multiple LLMs about it before. It's not an obscure procedure.)
Both of these points seem like they would be easy to instruct an LLM to shape its testing strategy.
I think so too. If unclear, I don't use LLMs for coding at the moment and was just commenting on what I've seen from others who do in computational fluid dynamics.
Edit: Let me add that while I think it would be easy to instruct a LLM to do what I'd like, LLMs don't do these things by default despite them being recognized as best practices, and I'm not confident in LLMs getting the data or references right for validation tests. My own experience is that LLMs are pretty bad when it comes to reproducing citations, and they tend to miss a lot of the literature.
> You have to have a curated list of every kind of test not to write
This should be distilled into a tool. Some kind of AST based code analyser/linter that fails if it sees stupid test structures.
Just having it in plain english in a HOW-TO-TEST.md file is hit and miss.
> have a curated list of every kind of test not to write
I've seen a lot of people interact with LLMs like this and I'm skeptical.
It's not how you'd "teach" a human (effectively). Teaching (humans) with positive examples is generally much more effective than with negative examples. You'd show them examples of good tests to write, discuss the properties you want, etc...
I try to interact with LLMs the same way. I certainly wouldn't say I've solved "how to interact with LLMs" but it seems to at least mostly work - though I haven't done any (pseudo-)scientific comparison testing or anything.
I'm curious if anyone else has opinions on what the best approach is here? Especially if backed up by actual data.
It's going to be difficult for anyone to have any more "data" than you already do. It's early days for all of us. It's not like there's anyone with 20 years of 2026 AI coding assistant experience.
However we can say based on the architecture of the LLMs and how they work that if you want them to not do something, you really don't want to mention the thing you don't want them to do at all. Eventually the negation gets smeared away and the thing you don't want them to do becomes something they consider. You want to stay as positive as possible and flood them with what you do want them to do, so they're too busy doing that to even consider what you didn't want them to do. You just plain don't want the thing you don't want in their vector space at all, not even with adjectives hanging on them.
I don't have much data to go on (in accordance with what 'jerf wrote), however I offer a high-level, abstract perspective.
The ideal set of outcomes exist as a tiny subspace of a high-dimensional space of possible solutions. Almost all those solutions are bad. Giving negative examples is removing some specific bits of the possibility space from consideration[0] - not very useful, since almost everything else that remains is bad too. Giving positive examples is narrowing down the search area to where the good solutions are likely to be - drastically more effective.
A more humane intuition[1], something I've observed as a parent and also through introspection. When I tell my kid to do something, and they don't understand WTF it is that I want, they'll do something weird and entirely undesirable. If I tell them, "don't do that - and also don't do [some other thing they haven't even thought of yet]", it's not going to improve the outcome; even repeated attempts at correction don't seem effective. In contrast, if I tell (or better, show) them what to do, they usually get the idea quickly, and whatever random experiments/play they invent, is more likely to still be helpful.
--
[0] - While paradoxically also highlighting them - it's the "don't think of a pink elephant" phenomenon.
[1] - Yes, I love anthropomorphizing LLMs, because it works.
It's not a person. Unlike a person it has a tremendous "memory" of everything ever done its creators could get access to.
If I tell it what to do, I bias it towards doing those things and limit its ability to think of things I didn't think of myself, which is what I want in testing. In separate passes, sure a pass where I prescribe types and specific tests is effective. But I also want it to think of things I didn't, a prompt like "write excellent tests that don't break these rules..." is how you get that.