Comment by SrslyJosh

3 days ago

> Reading through these commits sparked an idea: what if we treated prompts as the actual source code? Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.

> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.

How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?

My work has involved a project that is almost entirely generated code for over a decade. Not AI generated, the actual work of the project is in creating the code generator.

One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.

Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.

  • I think the biggest difference here is that your code generator is probably deterministic and you likely are able to debug the results it produces rather than treating it like a black box.

    • Overloading of the term "generate" is probably creating some confused ideas here. An LLM/agent is a lot more similar to a human in terms of its transformation of input into output than it is to a compiler or code generator.

      I've been working on a recent project with heavy use of AI (probably around 100 hours of long-running autonomous AI sprints over the last few weeks), and if you tried to re-run all of my prompts in order, even using the exact same models with the exact same tooling, it would almost certainly fall apart pretty quickly. After the first few, a huge portion of the remaining prompts would be referencing code that wouldn't exist and/or responding to things that wouldn't have been said in the AI's responses. Meta-prompting (prompting agents to prepare prompts for other agents) would be an interesting challenge to properly encode. And how would human code changes be represented, as patches against code that also wouldn't exist?

      The whole idea also ignores that AI being fast and cheap compared to human developers doesn't make it infinitely fast or free, or put it in the same league of quickness and cheapness as a compiler. Even if this were conceptually feasible, all it would really accomplish is making it so that any new release of a major software project takes weeks (or more) of build time and thousands of dollars (or more) burned on compute.

      It's an interesting thought experiment, but the way I would put it into practice would be to use tooling that includes all relevant prompts / chat logs in each commit message. Then maybe in the future an agent with a more advanced model could go through each commit in the history one by one, take notes on how each change could have been better implemented based on the associated commit message and any source prompts contained therein, use those notes to inform a consolidated set of recommended changes to the current code, and then actually apply the recommendations in a series of pull requests.

    • People keep saying this and it doesn't make sense. I review code. I don't construct a theory of mind of the author of the code. With AI-generated code, if it isn't eminently reviewable, I reflexively kill the PR and either try again or change the tasking.

      There's always this vibe that, like, AI code is like an IOCCC puzzle. No. It's extremely boring mid-code. Any competent developer can review it.

      8 replies →

  • > One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable

    My rule of the thumb is to have both in same repo, but treat generated code like binary data. This was informed by when I was burned by a tooling regression that broke the generated code and the investigation was complicated by having to correlate commits across different repositories

    • I love having generated code in the same repo as the generator because with every commit I can regenerate the code and compare it to make sure it stays in sync. Then it forms something similar to a golden tests where if something unexpected changes it gets noticed on review.

  • > One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable.

    Keeping a repository with the prompts, or other commands separate is fine, but not committing the generated code at all I find questionable at best.

    • If you can 100% reproduce the same generated code from the same prompts, even 5 years later, given the same versions and everything then I'd say "Sure, go ahead and don't saved the generated code, we can always regenerate it". As someone who spent some time in frontend development, we've been doing it like that for a long time with (MB+) generated code, keeping it in scm just isn't feasible long-term.

      But given this is about LLMs, which people tend to run with temperature>0, this is unlikely to be true, so then I'd really urge anyone to actually store the results (somewhere, maybe not in scm specifically) as otherwise you won't have any idea about what the code was in the future.

      5 replies →

    • I didn't read it as that - If I understood correctly, generated code must be quarantined very tightly. And inevitably you need to edit/override generated code and the manner by which you alter it must go through some kind of process so the alteration is auditable and can again be clearly distinguished from generated code.

      Tbh this all sounds very familiar and like classic data management/admin systems for regular businesses. The only difference is that the data is code and the admins are the engineers themselves so the temptation to "just" change things in place is too great. But I suspect it doesn't scale and is hard to manage etc.

  • There’s a huge difference between deterministic generated code and LLM generated code. The latter will be different every time, sometimes significantly so. Subsequent prompts would almost immediately be useless. “You did X, but we want Y” would just blow up if the next time through the LLM (or the new model you’re trying) doesn’t produce X at all.

  • I will guess that you are generating orders of magnitude more lines of code with your software than people do when building projects with LLMs - if this is true I don't think the analogy holds.

  • > The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

    This sounds like putting assembly in C code. What was the input language? These two bits ("Not AI generated", "a feature flag") suggest that the code generator didn't have a natural language frontend, but rather a real programming language frontend.

    Did you or anyone else inform management that a code generator is essentially a compiler with extra characters? [0] If yes, then what was their response?

    I am concerned that your current/past work might have been to build a Compiler-as-a-Service (CaaS). [1] No shade, I'm just concerned that other managers might read all this and then try to build their own CaaS.

    [0] Yes, I'm implying that LLMs are compilers. Altman has played us for fools; he's taught a billion people the worst part of programming: fighting the compiler to give you the output you want.

    [1] Compiler-as-a-Service is the future our forefathers couldn't imagine warning us about. LLMs are CaaS's; time is a flat circle; where's the exit?; I want off this ride.

    • The input was a highly structured pdf specification of a family of protocols and formats. Essentially, a real language with very stupid parsing requirements and the occasional typo. The PDF itself was clearly intended for human consumption, but I'm 99% sure that someone somewhere at some point had a machine readable specification that was used to generate most of the PDF. Sadly, no one seems to know where to even start looking for such a thing.

      > Did you or anyone else inform management that a code generator is essentially a compiler with extra characters?

      The output of the code generator was itself fed into a compiler that we also built; and about half of the codegen team (myself included) were themselves developers for the compiler.

      I think management is still scared from the 20 year old M4 monstrosity we are still maintaining because writing a compiler would be "too complex".

  • Please tell us we company you are working for so that we don't send our resumes there.

    Jokes aside, I have worked in projects where auto-generating code was the solution that was chosen and it's always been 100% auto-generated, essentially at compilation time. Any hand-coded stuff needed to handle corner cases or glue pieces together was kept outside of the code generator.

I'm the first to admit that I'm an AI skeptic, but this goes way beyond my views about AI and is a fundamentally unsound idea.

Let's assume that a hypothetical future AI is perfect. It will produce correct output 100% of the time, with no bugs, errors, omissions, security flaws, or other failings. It will also generate output instantly and cost nothing to run.

Even with such perfection this idea is doomed to failure because it can only write code based on information in the prompt, which is written by a human. Any ambiguity, unstated assumption, or omission would result in a program that didn't work quite right. Even a perfect AI is not telepathic. So you'd need to explain and describe your intended solution extremely precisely without ambiguity. Especially considering in this "offline generation" case there is no opportunity for our presumed perfect AI to ask clarifying questions.

But, by definition, any language which is precise and clear enough to not produce ambiguity is effectively a programming language, so you've not gained anything over just writing code.

  • This is so eloquently put and really describes the absurdity of the notion that code itself will become redundant to building a software system

  • We already have AI agents that can ask a human for help / clarification in those cases.

    It could also analyze the company website, marketing materials, and so forth, and use that to infer the missing pieces. (Again, something that exists today)

    • If the AI has to ask for clarification, you can’t run it as a reproducible build step as envisaged. It’s as if your compiler would pause to ask clarifying questions on each CI run.

      If the company website, marketing materials, and so forth become part of the input, you’ll have to put those in version control as well, as any change is likely to result in a different application being generated (which may or may not be what you want).

The idea as stated is a poor one, but a slight reshuffling and it seems promising:

You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.

Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.

  • Means the temperature should be set to 0 (which not every provider supports) so that the output becomes entirely deterministic. Right now with most models if you give the same input prompt twice it will give two different solutions.

    • Even at temp 0, you might get different answers, depending on your inference engine. There might be hardware differences, as well as software issues (e.g. vLLM documents this, if you're using batching, you might get different answers depending on where in the batch sequence your query landed).

    • Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.

    • Claude Code already uses a temperature of 0 (just inspect the requests) but it's not deterministic

      Not to mention it also performs web searches, web fetching etc which would also make it not deterministic

    • Two years ago when I was working on this at a startup, setting OAI models’ temp to 0 still didn’t make them deterministic. Has that changed?

    • Do LLMs inference engines have a way to seed their randomness? so tho have reproducible outputs with still some variance if desired?

      1 reply →

    • This is good: run it n times, have the model review them and pick the best one.

    • I would only care about more deterministic output if I was repeating the same process with the same model, which is not the point of the exercise.

  • Your rephrasing better encompasses my idea, and I should have emphasized in the post that I do not think this is a good idea (nor possible) right now, it was more of a hand-wavy "how could we rethink source control in a post-LLM world" passing thought I had while reading through all the commits.

    Clearly it struck a chord with a lot of the folks here though, and it's awesome to read the discourse.

  • One reason we treat tests that way is that we don’t generally rewrite the application from scratch, but usually only refactor parts of the existing code or make smaller changes. If we regularly did the former, test suites would have to be much mire comprehensive than they typically are. Not to mention that the tests need to change when the API changes, so you generally have to rewrite the unit tests along with the application and can’t apply them unchanged.

>> what if we treated prompts as the actual source code?

You would not do this because: unlike programming languages, natural languages are ambiguous and thus inadequate to fully specify software.

  • Exactly!

    > this assumes models can achieve strict prompt adherence

    What does strict adherence to an ambiguous prompt even mean? It’s like those people asking Babbage if his machine would give the right answer when given the wrong figures. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a proposition.

  • Prompts are like story on the board, and like engineers, depends on the understanding of the model the generated source code can vary. Saying the prompts could be the actual code is so wrong and dangerous thought

Worse. Models aren't deterministic! They use temperature value to control randomness, just so they can escape local minima!

Regenerated code might behave differently, have different bugs(worst case), or not work at all(best case).

  • Nitpick - it's the ML system that is sampling from model predictions that has a temperature parameter, not the model itself. Temperature and even model aside, there are other sources of randomness like the underlying hardware that can cause the havoc you describe.

Plus, commits depend on the current state of the system.

What sense does “getting rid of vulnerabilities by phasing out {dependency}” make, if the next generation of the code might not rely on the mentioned library at all? What does “improve performance of {method}” mean if the next generation used a fully different implementation?

It makes no sense whatsoever except for a vibecoders script that’s being extrapolated into a codebase.

I'd say commit a comprehensive testing system with the prompts.

Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.

I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.

I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.

  • Agreed with the medium-term solution. I wish I put some more detail into that part of the post, I have more thoughts on it but didn't want to stray too far off topic.

Apart from obvious non-reproducibility, the other problem is lack of navigable structure. I can't command+click or "show usages" or "show definition" any more.

I'm pretty sure most people aren't doing "software engineering" when they program. There's the whole world of WordPress and dream Weaver like programing out there too where the consequences of messing up aren't really important.

Llms can be configured to have deterministic output too

Also, while it is in principle possible to have a deterministic LLM, the ones used by coding assistants aren't deterministic, so the prompts would not reliably reproduce the same software.

There is definitely an argument, for also committing prompts, but it makes no sense to only commit prompts.

Forget different model versions. The exact same model with the exact same prompt will generate vastly different code each subsequent time you invoke it.

I think the author is saying you commit the prompt with the resulting code. You said it yourself, storage is free, so comment the prompt along with the output (don’t comment that out that if I’m not being clear); it would show the developers(?) intent, and to some degree, almost always contribute to the documentation process.

  • Author here :). Right now, I think the pragmatic thing to do is to include all prompts used in either the PR description and/or in the commit description. This wouldn't make my longshot idea of "regenerating a repo from the ground up" possible, but it still adds very helpful context to code reviewers and can help others on your team learn prompting techniques.

Some code is generated on the fly, like llm ui/ux that writes python code to do math.

Idk kinda different tho.

The idea is good, but we should commit both documentation and tests. They allow regenerating the code at will.

You couldn’t even tell in advance if the prompt produces code at all.

Yes, it's too early to be doing that now, but if you see the move to AI-assisted code as at least the same magnitude of change as the move from assembly to high level languages, the argument makes more sense.

Nobody commits the compiled code; this is the direction we are moving in, high level source code is the new assembly.