← Back to context

Comment by tpmoney

15 hours ago

You have the source code though. That is the "reproducibility" bit you need. What extra reproducibility does having the prompts give you? Especially given that AI agents are non-deterministic in the first place. To me the idea that the prompts and sessions should be part of the commit history is akin to saying that the keystroke logs and commands issued to the IDE should be part of the commit history. Is it important to know that when the foo file was refactored the developer chose to do it by hand vs letting the IDE do it with an auto-refactor command vs just doing a simple find and replace? Maybe it is for code review purposes, but for "reproducibility" I don't think it is. You have the code that made build X and you have the code that made build X+1. As long as you can reliably recreate X and X+1 from what you have in the code, you have reproducibility.

> You have the source code though. That is the "reproducibility" bit you need.

I am talking about reproducing the (perhaps erroneous) logic or thinking or motivations in cases of bugs, not reproducing outputs perfectly. As you said, current LLM models are non-deterministic, so we can't have perfect reproducibility based on the prompts, but, when trying to fix a bug, having the basic prompts we can see if we run into similar issues given a bad prompt. This gives us information about whether the bad / bugged code was just a random spasm, or something reflecting bad / missing logic in the prompt.

> Is it important to know that when the foo file was refactored the developer chose to do it by hand vs letting the IDE do it with an auto-refactor command vs just doing a simple find and replace? Maybe it is for code review purposes, but for "reproducibility" I don't think it is.

I am really using "reproducibility" more abstractly here, and don't mean perfect reproducibility of the same code. I.e. consider this situation: "A developer said AI wrote this code according to these specs and prompt, which, according to all reviewers, shouldn't produce the errors and bad code we are seeing. Let's see if we can indeed reproduce similar code given their specs and prompt". The less evidence we have of the specifics of a session, the less reproducible their generated code is, in this sense.

  • It's not reproducible though.

    Even with the exact same prompt and model, you can get dramatically different results especially after a few iterations of the agent loop. Generally you can't even rely on those though: most tools don't let you pick the model snapshot and don't let you change the system prompt. You would have to make sure you have the exact same user config too. Once the model runs code, you aren't going to get the same outputs in most cases (there will be date times, logging timestamps, different host names and user names etc.)

    I generally avoid even reading the LLM's own text (and I wish it produced less of it really) because it will often explain away bugs convincingly and I don't want my review to be biased. (This isn't LLM specific though -- humans also do this and I try to review code without talking to the author whenever possible.)

  • You are talking about documenting the intent of a piece of software if I understand correctly. But isn't that what READMEs and comments are for?

  • > I am talking about reproducing the (perhaps erroneous) logic or thinking or motivations in cases of bugs

    But "to what purpose" is where this all loses me. What do you gain from seeing what was said to the AI that generated the bug? To me it feels like these sorts of things will fall into 3 broad categories:

    1) Underspecified design requirements

    2) General design bugs arising from unconsidered edge cases

    3) AI gone off the rails failures

    For items in category 1, these are failures you already know how to diagnose with human developers and your design docs should already be recorded and preserved as part of your development lifecycle and you should be feeding those same human readable design documents to the AI. The session output here seems irrelevant to me as you have the input and you have the output and everything in between is not reproducible with an AI. At best, if you preserve the history you can possibly get a "why" answer out of it in the same way that you might ask a dev "why did you interpret A to mean B", but you're preserving an awful lot of noise and useless data int the hopes that the AI dropped something in it's output that shows you someplace your spec isn't specific or detailed enough that a simple human review of the spec wouldn't catch anyway once the bug is known.

    For category 2, again this is no different from the human operator case and there's no value that I can see in confirming in the logs that the AI definitely didn't consider this edge case (or even did consider it and rejected it for some erroneous reason). AI models in the forms that folks are using them right now are not (yet? ever?) capable of learning from a post mortem discussion about something like that to improve their behavior going forward. And its not even clear to me that even if they were, you would need the output of the session as opposed to just telling the robot "hey at line 354 in foo.bar you assumed that A would never be possible, but no place in the code before that point asserts it, so in the future you should always check for the possibility of A because our system can't guarantee it will never occur."

    And as for category 3, since it's going off the rails, the only real thing to learn is whether you need a new model entirely or if it was a random fluke, but since you have the inputs used and you know they're "correct", I don't see what the session gives you here either. To validate whether you need a new model, it seems that just feeding your input again and seeing if you get a similar "off the rails" result is sufficient. And if you don't get another "off the rails" result, I sincerely doubt your model is going to be capable of adequately diagnosing its own internal state to sort out why you got that result 3 months ago.

The source code is whatever is easiest for a human to understand. Committing AI-generated code without the prompts is like committing compiler-generated machine code.