Comment by simonw
7 days ago
I had a fascinating conversation about this the other day. An engineer was telling me about his LLM process, which is effectively this:
1. Collaborate on a detailed spec
2. Have it implement that spec
3. Spend a lot of time on review and QA - is the code good? Does the feature work well?
4. Take lessons from that process and write them down for the LLM to use next time - using CLAUDE.md or similar
That last step is the interesting one. You're right: humans improve, LLMs don't... but that means it's on us as their users to manage the improvement cycle by using every feature iteration as as opportunity to improve how they work.
I've heard similar things from a few people now: by constantly iterating on their CLAUDE.md - adding extra instructions every time the bot makes a mistake, telling it to do things like always write the tests first, run the linter, reuse the BaseView class when building a new application view, etc - they get wildly better results over time.
I don't buy your last sentence at all.
AGENTS.md is just a place to put stuff you don't want to tell LLMs over and over again. They're not magical instructions LLMs follow 100% of the time, they don't carry any additional importance over what you put into the prompt manually. Your carefully curated AGENTS.md is only really useful at the very beginning of the conversation, but the longer the conversation gets, the less important those tokens on the top are. Somewhere around 100k tokens AGENTS.md might as well not exit, I constantly have to "remind it" of the very first paragraph there.
Go start a conversation and contradict what's written in AGENTS.md half way through the problem. Which of the two contradicting statements will take preference? The latter one! Therefore, all the time you've spent curating your AGENTS.md is the time you've wasted thinking you're "teaching" LLMs anything.
Whether the tokens are created manually or programmatically isn't really relevant here. The order and amount of tokens is, in combination with the ingestion -> output logic which the LLM API / inference engine operates on. Many current models definitely have the tendency to start veering off after 100k tokens, which makes context pruning important as well.
What if you just automatically append the .md file at the end of the context, instead of prepending at the start, and add a note that the instructions in the .md file should always be prioritized?
> Your carefully curated AGENTS.md is only really useful at the very beginning of the conversation, but the longer the conversation gets, the less important those tokens on the top are.
If that's genuinely causing you problems you can restart your session frequently to avoid the context rot.
Come on, let's not pretend 100k tokens is something I need to spend hours to reach for your helpful advice to be even remotely valid, it's something even the most basic problems struggle to fit into.
For the fun of it I just started a new conversation with Sonnet 4, passed it one 550 lines long file (25 kilobytes) and my AGENTS.md (<200 lines, 8 kilobytes) and my only instructions were to "do nothing". It spat out exactly 100 words describing my file without modifying anything and that's already almost a fifth of my context window gone (18k tokens to be exact).
I then asked it to re-write a part of it to "make it look better" (184 lines added, 112 lines deleted according to git) and I'm already at 33k before I got to review a single line. Heaven forbid I need to build on top of that change in a different file, because by then my AGENTS.md might as well not exist!
1 reply →
We really should be sharing wisdom about AGENTS.md files here.
I thought about making some kind of community project where people could contribute their lines to a common file, and even some kind of MCP server or RAG system that automatically selects relevant "rules" given a certain project context. Do you think there would be interest in something like that?
1 reply →
The problem is that you get to 100k tokens. Don't do that, split tasks into smaller ones.
Totally agree on this. It has delivered a substantial value for me in my projects. The models are always going to give back results optimized for using minimal computing resources in the provider's infrastructure. To overcome this I see some using/suggesting, running the AI in self correction loops, the pro being least human intervention.
However, personally I have got very good results by taking the approach of using the AI with continuous interaction and also allowing implementation only after a good amount of time deliberating on design/architecture. I almost always append 'do not implement before we discuss and finalize the design' or 'clarify your assumptions, doubts or queries before implementation'.
When I asked Gemini to give a name for such an interaction it suggested 'Dialog Driven Development' also contrasted it against 'vide coding'. Transcript summary and AI disclaimer written by Gemini below
https://gingerhome.github.io/gingee-docs/docs/ai-disclaimer.... https://gingerhome.github.io/gingee-docs/docs/ai-transcript/...
I’m finding that whether this process works well is a measure (and a function) of how well-factored and disciplined a codebase is in the first place. Funnily enough, LLMs do seem to have a better time extending systems that are well-engineered for extensibility.
That’s the part which gives me optimism, and even more enjoyment of the craft — that quality pays back so immediately, makes it that much easier to justify the extra effort, and having these tools at our disposal reduces the ‘activation energy’ for necessary re-work that may before have just seemed too monumental.
If a codebase is in a good shape for people to produce high-quality work, then so too can the machines. Clear, up-to-date, close-to-the-code, low redundancy documentation; self-documenting code and tests, that prioritizes expression of intent over cleverness; consistent patterns of abstraction that don’t necessitate jarring context switches from one area to the next; etc.
All this stuff is so much easier to lay down with an agent loaded up on the relevant context too.
Edit: oh, I see you said as much in the article :)
> but that means it's on us as their users to manage the improvement cycle by using every feature iteration as as opportunity to improve how they work
This doesn't interest me at all honestly
And every change to the model might invalidate all of this work?
No thank you