Comment by 827a
14 hours ago
IMO: This might be a contrarian opinion, but I don't think so. Its much the same problem as asking, for example, if every single line you write, or every function, becomes a commit. The answer to this granularity is, much like anything, you have to think of the audience: Who is served by persisting these sessions? I would suspect that there is little reason why future engineers, or future LLMs, would need access to them; they likely contain a significant amount of noise, incorrect implementations, and red herrings. The product of the session is what matters.
I do think there's more value in ensuring that the initial spec, or the "first prompt" (which IME is usually much bigger and tries to get 80% of the way there) is stored. And, maybe part of the product is an LLM summary of that spec, the changes we made to the spec within the session, and a summary of what is built. But... that could be the commit message? Or just in a markdown file. Or in Notion or whatever.
While it's noisy and complicated for humans to read through, this session info is primarily for future AI to read and use as additional input for their tasks.
We could have LLMs ingest all these historical sessions, and use them as context for the current session. Basically treat the current session as an extension of a much, much longer previous session.
Plus, future models might be able to "understand" the limitations of current models, and use the historical session info to identity where the generated code could have deviated from user intention. That might be useful for generating code, or just more efficient analysis by focusing on possible "hotspots", etc.
Basically, it's high time we start capturing any and all human input for future models, especially open source model development, because I'm sure the companies already have a bunch of this kind of data.
That's exactly one of the reasons I've been archiving the sessions using DataClaw. The sessions can contain more useful information than the comments for humans.
[0] https://github.com/peteromallet/dataclaw
But AI can just read the diff. The natural language isn't important.
Or just "write a good commit message based on our session, pls", then both humans and llms can use it.
TBH I don't think it's worth the context space to do this. I'm skeptical that this would have any meaningful benefits vs just investing in targeted docs, skills, etc.
I already keep a "benchmarks.md" file to track commits and benchmark results + what did/ did not work. I think that's far more concise and helpful than the massive context that was used to get there. And it's useful for a human to read, which I think is good. I prefer things remain maximally beneficial to both humans and AI - disconnects seem to be problematic.
> While it's noisy and complicated for humans to read through, this session info is primarily for future AI to read and use as additional input for their tasks.
Context rot is very much a thing. May still be for future agents. Dumping tens/hundreds of thousand of trash tokens into context very much worsen the performance of the agent
Similarly, git logs of existing human code seem to be a good source of info that llms don't look at unless explicitly prompted to do so.
Right now, it might not be worth the cost. That might change in future so that they consider it by default?
Future AIs can probably infer the requirements better than humans can write them.
It's just noise for AI too. There is no reason to be lazy with context management when you can simply ask the AI to write the summary of the session. But even that is hardly useful when AI can just read the source of truth which is the code and committed docs
> Its much the same problem as asking, for example, if every single line you write, or every function, becomes a commit.
Hmm, I think that's the wrong comparison? The more useful comparison might be: should all your notes you made and dead ends you tried become part of the commit?
When a human writes the code should all their slack messages about the project be committed into the repo?
Ideally, yes? Or a reference ticket number pointing to that discussion
The main limitation is the human effort to compile that information, but if the LLM already has the transcript ready, its free
Ideally, yes. Although Slack is a vendor lock-in and we need a better platform to archive the sessions.
1 reply →
That would be amazing! In the moment, it's a lot of noise, but say you're trying to figure out a bit of code that Greg wrote four years ago and oh btw he's no longer with the company. Having access to his emails and slack would be amazing context to try reverse engineer and figure out whytf he did what he did. Did he just pick a thing and run with it, so I can replace it and not worry about it, or was it a very intentional choice and do not replace, because everything else will break?
In some cases this is what I ask from my juniors. Not for every commit, but during some specific reviews. The goal is to coach them on why and how they got a specific result.
What is a junior? I don't see it in claude.
1 reply →
I think this too. I use the initial spec from the issue tracker as the prompt and work from there.
The missteps the agent takes and the nudging I do along the way are ephemeral, and new models and tooling will behave differently.
If you have the original prompt and the diff you have everything you need.
This is a central problem that weve already seen proliferate wildly in Scientific research , and currently if the same is allowed to be embedded in foundational code. The future outlook would be grim.
Replication crisis[1].
Given initial conditions and even accounting for 'noise' would a LLm arrive at the same output.It should , for the same reason math problems require one to show their working. Scientific papers require the methods and pseudocode while also requireing limitations to be stated.
Without similar guardrails , maintainance and extension of future code becomes a choose your own adventure.Where you have to guess at the intent and conditions of the LLM used.
[1] https://www.ipr.northwestern.edu/news/2024/an-existential-cr...
Agentic engineering is fundamentally different, not just because of the inherent unpredictability of LLMs, but also because there's a wildly good chance that two years from now Opus 4.6 will no longer even be a model anyone can use to write code with.
You can leave commit messages or comments without spamming your history with every "now I'm inspecting this file..." or "oops, that actually works differently than I expected" transcript.
In fact, I'd wager that all that excess noise would make it harder to discern meaningful things in the future than simply distilling the meaningful parts of the session into comments and commit messages.
IMO, you should do both. The cost of intellectual effort is dropping to zero, and getting an AI to scan through a transcript for relevant details is not going to cost much at all.
Those messages are part of the linguistic context used to generate the code, though. Don’t confuse them for when humans (or human written programs) display progress messages.
If they aren’t important for your specific purposes, you can summarize them with an LLM.
Even if you pin the seed and spin up your own local LLM, changes to continuous batching at the vLLM level or just a different CUDA driver version will completely break your bitwise float convergence. Reproducibility in ML generation is a total myth, in prod we only work with the final output anyway
> for the same reason math problems require one to show their working.
We don't put our transitional proofs in papers, only the final best one we have. So that analogy doesn't work.
For every proof in a paper there is probably 100 non-working / ugly sketches or just snippets of proofs that exist somewhere in a notebook or erased on a blackboard.
but we've been doing the same without llm. what're the new pieces which llm would bring in?
with normal practice , say if im reading through the linux source for a particular module.Id be able to refernce mailing lists and patchsets which by convention have to be human parsable/reviewable.Wit the history/comments/git blame etc putting in ones headspace the frame of reference that produced it.
There is some potential value for the audit if you work in a special place where you are sworn in and where transparency is important, but who gonna read all of that and how do you even know that the transcript corresponds to the code if the committer is up to something
I agree that probably not everything should be stored - it’s too noisy. But the reason the session is so interesting is precisely the later part of the conversation - all the corrections in the details, where the actual, more precise requirements crystallize.
AKA the code. You're all talking about the code.
The prompt is the code :) The code is like a compiled binary. How long until we put the prompts in `src/` and the code in `bin/`, I wonder...
1 reply →
Not at all, unless it contains very thorough reasoning comments (which arguably it should). The code is only an artifact, a lot of which is incidental and flexible. The prompts contain the actual constraints.
People are trying to retain value as their value is being evaporated.
Then just summarize the final requirements
That’s what I do! I think it works well and helps future agents a lot in understanding why the codebase is the way it is. I do have to oversee the commit messages, but it does avoid a lot of noise and maybe it’s a normal part of HITL development.
If it's non-trivial work, have the Agent distill it down to an ADR.
> Its much the same problem as asking, for example, if every single line you write, or every function, becomes a commit.
As a huge fan of atomic commits I'd say that smallest logical piece should be a commit. I never seen "intention-in-a-commit", i.e. multiple changes with overarching goal influence reviews. There's usually some kind of ticket that can be linked to the code itself if needed.
For me, it’s about preserving optionality.
If I can run resume {session_id} within 30 days of a file’s latest change, there’s a strong chance I’ll continue evolving that story thread—or at least I’ve removed the friction if I choose to.
It seems unlikely that a file that hasn't changed in 30 days in an environment with a lot of "agents" cranking away on things is going to be particularly meaningful to revisit with the context from 30 days ago, vs using new context with everything that's been changed and learned since then.
First N prompts is a good / practical heuristic for something worth storing (whether N = 1 or greater).
> Who is served by persisting these sessions? I would suspect that there is little reason why future engineers, or future LLMs, would need access to them
I disagree. When working on legacy code, one of my biggest issues is usually the question 'why is this the way it is?' Devs hate documentation, Jira often isn't updated with decisions made during programming, so sometimes you just have to guess why 'wait(500)' or 'n = n - 1' are there.
If it was written with AI and the conversation history is available, I can ask my AI: 'why is this code here?', which would often save me a ton of time and headache when touching that code in the future.
LLM session transcripts as part of the commit is a neat idea to consider, to be sure, but I know that I damn well don't want to read eight pages of "You're absolutely right! It's not a foo. It's a bar" slop (for each commit no less!) when I'm trying to find someone to git blame.
The solution is as it always has been: the commit message is where you convey to your fellow humans, succinctly and clearly, why you made the commit.
I like the idea of committing the initial transcript somewhere in the docs/ directory or something. I'll very likely start doing this in my side projects.
You ignore the reality of vibe coding. If someone just prompts and never reads the code and tests the result barely, then the prompts can be a valuable insight.
But I am not rooting for either, just saying.
If A vibes, and B is overwhelmed with noise, how does B reliably go through it? If using AI, this necessarily faces the same problems that recording all A's actions was trying to solve in the first place, and we'd be stuck in a never-ending cycle.
We could also distribute the task to B, C, D, ... N actors, and assume that each of them would "cover" (i.e. understand) some part of A's output. But this suddenly becomes very labor intensive for other reasons, such as coordination and trust that all the reviewers cover adequately within the given time...
Or we could tell A that this is not a vibe playground and fire them.