Comment by fastball
3 days ago
The idea as stated is a poor one, but a slight reshuffling and it seems promising:
You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.
Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.
Means the temperature should be set to 0 (which not every provider supports) so that the output becomes entirely deterministic. Right now with most models if you give the same input prompt twice it will give two different solutions.
Even at temp 0, you might get different answers, depending on your inference engine. There might be hardware differences, as well as software issues (e.g. vLLM documents this, if you're using batching, you might get different answers depending on where in the batch sequence your query landed).
Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.
Claude Code already uses a temperature of 0 (just inspect the requests) but it's not deterministic
Not to mention it also performs web searches, web fetching etc which would also make it not deterministic
Two years ago when I was working on this at a startup, setting OAI models’ temp to 0 still didn’t make them deterministic. Has that changed?
Do LLMs inference engines have a way to seed their randomness? so tho have reproducible outputs with still some variance if desired?
Yes, although it's not always exposed to the end user of LLM providers.
This is good: run it n times, have the model review them and pick the best one.
I would only care about more deterministic output if I was repeating the same process with the same model, which is not the point of the exercise.
Your rephrasing better encompasses my idea, and I should have emphasized in the post that I do not think this is a good idea (nor possible) right now, it was more of a hand-wavy "how could we rethink source control in a post-LLM world" passing thought I had while reading through all the commits.
Clearly it struck a chord with a lot of the folks here though, and it's awesome to read the discourse.
One reason we treat tests that way is that we don’t generally rewrite the application from scratch, but usually only refactor parts of the existing code or make smaller changes. If we regularly did the former, test suites would have to be much mire comprehensive than they typically are. Not to mention that the tests need to change when the API changes, so you generally have to rewrite the unit tests along with the application and can’t apply them unchanged.