Comment by rfw300
16 hours ago
Why should it be? The agent session is a messy intermediate output, not an artifact that should be part of the final product. If the "why" of a code change is important, have your agent write a commit message or a documentation file that is polished and intended for consumption.
This reduces down to the problem of summarization - a quite difficult one. At commit time it’s difficult to know what questions readers will have. You can get close but never all the way there.
Pre AI when engineers couldn’t find the answer in commit messages or documentation they would ask the author “why” and that human would “compute” the summary on demand.
I think that’s what I expect to do with these agent sessions - I don’t want more markdown, I want to ask it questions on demand. Git AI (https://github.com/git-ai-project/git-ai) uses the prompts that way. I think that model will win out. Save sessions. Read/ask questions relevant to the current agent’s work.
On asking peers. This is regrettably on the way out today - I’ll ask engineers about complex code they generated and they can’t give good answers. I think it’s because it all happened so fast — they didn’t sit with the problem for 48 hours. So even if they steered the agent thoughtfully it’s hard to remember all the decisions they made a week later.
It should be a distillation of the session and/or the prompts, at bare minimum. No, it should not include e.g. research-type questions, but it should include prompts that the user wrote after reading the answers to those research-type questions, and perhaps some distillation of the links / references surfaced during the research.
Prompts probably should be distilled / summarized, especially if they are research-based prompts, but code-gen prompts should probably be saved verbatim.
Reproducibility is a thing, and though perfect reproducibility isn't desirable, something needs to make up for the fact that vibe-coding is highly inscrutable and hard to review. Making the summary of the session too vague / distilled makes it hard to iterate and improve when / if some bad prompts / assumptions are not documented in any way.
You have the source code though. That is the "reproducibility" bit you need. What extra reproducibility does having the prompts give you? Especially given that AI agents are non-deterministic in the first place. To me the idea that the prompts and sessions should be part of the commit history is akin to saying that the keystroke logs and commands issued to the IDE should be part of the commit history. Is it important to know that when the foo file was refactored the developer chose to do it by hand vs letting the IDE do it with an auto-refactor command vs just doing a simple find and replace? Maybe it is for code review purposes, but for "reproducibility" I don't think it is. You have the code that made build X and you have the code that made build X+1. As long as you can reliably recreate X and X+1 from what you have in the code, you have reproducibility.
> You have the source code though. That is the "reproducibility" bit you need.
I am talking about reproducing the (perhaps erroneous) logic or thinking or motivations in cases of bugs, not reproducing outputs perfectly. As you said, current LLM models are non-deterministic, so we can't have perfect reproducibility based on the prompts, but, when trying to fix a bug, having the basic prompts we can see if we run into similar issues given a bad prompt. This gives us information about whether the bad / bugged code was just a random spasm, or something reflecting bad / missing logic in the prompt.
> Is it important to know that when the foo file was refactored the developer chose to do it by hand vs letting the IDE do it with an auto-refactor command vs just doing a simple find and replace? Maybe it is for code review purposes, but for "reproducibility" I don't think it is.
I am really using "reproducibility" more abstractly here, and don't mean perfect reproducibility of the same code. I.e. consider this situation: "A developer said AI wrote this code according to these specs and prompt, which, according to all reviewers, shouldn't produce the errors and bad code we are seeing. Let's see if we can indeed reproduce similar code given their specs and prompt". The less evidence we have of the specifics of a session, the less reproducible their generated code is, in this sense.
3 replies →
The source code is whatever is easiest for a human to understand. Committing AI-generated code without the prompts is like committing compiler-generated machine code.
> It should be a distillation of the session and/or the prompts, at bare minimum.
Huh, I thought that's what commit message is for.
I mean, sure, a good, detailed commit message is perfectly fine to me in place of the prompts / a session distillation. But I am not holding my breath for vibe-coders to properly review their code and make such a commit message. But, if they, do, great! No need for prompt / session details.
Completely agree. Until recently I only let LLMs write my commit messages, but I've found that versioning the plan files is the better artifact, it preserves agentic decisions and my own reasoning without the noise.
My current workflow: write a detailed plan first, then run a standard implement -> review loop where the agent updates the plan as errors surface. The final plan doc becomes something genuinely useful for future iterations, not just a transcript of how we got there.
In my case I have set up the agent is the repo. The repo texts compose the agent’s memory. Changes to the repo require the agent to approve.
Repos also message each other and coordinate plans and changes with each other and make feature requests which the repo agent then manages.
So I keep the agents’ semantically compressed memories as part of the repo as well as the original transcripts because often they lose coherence and reviewing every user submitted prompt realigns the specs and stories and requirements.
post mortems / bug hunting -- pinpointing what part of the logic was to blame for a certain problem.
this is what granular commits are for, the kilobytes long log of claude running in circles over bullshit isn't going to help anyone
I think the parent comment is saying “why did the agent produce this big, and why wants it caught”, which is a separate problem from what granular commits solve, of finding the bug in the first place.
2 replies →
Then look at the code, the session will only confuse. To read an LLM's explanation is to anthropomorphize what will just be a probabilistic incident.
but that takes more tokens and time. if you just save the raw log, you can always do that later if you want to consume it. plus, having the full log allows asking many different questions later.
How’s it any different than a diff log?
Better question: how is it in any way similar?
If you read the history of both and assuming that there’s good comments and documentation, it shows you the reasoning that went into the decision-making