Comment by deaux
5 days ago
I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.
> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).
This "surprisingly", and the framing seems misplaced.
For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.
> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)
This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.
The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.
Hey thanks for your review, a paper author here.
Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.
The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.
Without measuring quality of output, this seems irrelevant to me.
My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.
Performance is not a consideration.
If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.
CLAUDE.md isn't a silver bullet either, I've had it lose context a couple of questions deep. I do like GSD[1] though, it's been a great addition to the stack. I also use multiple, different LLMs as a judge for PRs, which captures a load of issues too.
[1] https://github.com/gsd-build/get-shit-done
In this context, "performance" means "does it do what we want it to do" not "does it do it quickly". Quality of output is what they're measuring, speed is not a consideration.
4 replies →
You're measuring binary outcomes, so you can use a beta distribution to understand the distribution of possible success rates given your observations, and thereby provide a confidence interval on the observed success rates. This week help us see whether that 4% success rate is statistically significant, or if it is likely to be noise.
I’ve only ever gotten, like, slight wording suggestions from reviewers. I wish they would write things like this instead—it is possibly meaningful and eminently do-able (doesn’t even require new data!).
1 reply →
> Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.
Ok so that's interesting in itself. Apologies if you go into this in the paper, not had time to read it yet, but does this tell us something about the models themselves? Is there a benchmark lurking here? It feels like this is revealing something about the training, but I'm not sure exactly what.
It could... but as pointed out by other the significance is unclear and per-model results have even less samples than the benchmark average. So: maybe :)
Thank you for turning up here and replying!
> The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
I think the coding agent recommended LLM-generated AGENTS.md files are almost without exception really bad. Because the AGENTS.md, to perform well, needs to point out the _non_-obvious. Every single LLM-generated AGENTS.md I've seen - including by certain vendors who at one point in time out-of-the-box included automatic AGENTS.md generation - wrote about the obvious things! The literal opposite of what you want. Indeed a complete and utter waste of tokens that does nothing but induce context rot.
I believe this is because creating a good one consumes a massive amount of resources and some engineering for any non-trivial codebase. You'd need multiple full-context iterations, and a large number of thinking tokens.
On top of that, and I've said this elsewhere, most of the best stuff to put in AGENTS.md is things that can't be inferred from the repo. Things like "Is this intentional?", "Why is this the case?" and so on. Obviously, the LLM nor a new-to-the-project human could know this or add them to the file. And the gains from this are also hard to capture by your performance metric, because they're not really about the solving of issues, they're often about direction, or about the how rather than the what.
As for the extra tokens, the right AGENTS.md can save lots of tokens, but it requires thinking hard about them. Which system/business logic would take the agent 5 different file reads to properly understand, but can we summarize in 3 sentences?
Yes that's a great summary and I agree broadly.
Note with different prompt types I refer to different types of meta-prompts to generate the AGENTS.md. All of these are quite useless. Some additional experiments not in the paper showed that other automated approaches are also useless ("memory" creating methods, broadly speaking).
1 reply →
In Theory There Is No Difference Between Theory and Practice, While In Practice There Is.
In large projects, having a specific AGENTS.md makes the difference between the agent spending half of its context window searching for the right commands, navigating the repo, understanding what is what, etc., and being extremely useful. The larger the repository, the more things it needs to be aware of and the more important the AGENTS.md is. At least that's what I have observed in practice.
> The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency.
This. I have Claude write about the codebase because I get tired of it grepping files constantly. I rather it just know “these files are for x, these files have y methods” and I even have it breakdown larger files so it fits the entire context window several times over.
Funnily enough this makes it easier for humans to parse.
My pet peeve with AI is that it tends to work better in codebase where humans do well and for the same reason.
Large orchestration package without any tests that relies on a bunch of microservices to work? Claude Code will be as confused as our SDEs.
This in turns lead to broader effort to refactor our antiquated packages in the name of "making it compatible with AI" which actually means compatible with humans.
In my opinion it’s not just compatible with AI its code that now fits in your head. Lots of famous “we can rewrite it later” remarks throughout my career… Well the AI can rewrite it, and now you can understand it.
Always make it write out a plan, write out unit tests that match the codebase as-is, and if adjusted are only changed in how they call the code in the future, giving you confidence that the rewrite didn't break core logic.
Why is that a pet peeve, though? Seems like a win/win.
This reads a lot like bargaining stage. If agentic AI makes me a 10 times more productive developer, surely a 4% improvement is barely worth the token cost.
> If agentic AI makes me a 10 times more productive
I'm not sure what you are suggesting exactly, but wanted to highlight this humongous "if".
It's not only about the token cost! It's also my TIME cost! Much-much more expensive than tokens, it turns out ;)
If something makes you 10x as effective and then you improve that thing by 4%...
10x is that quantity or quality?
Also, "perceived" or "real"?
Honestly, the more research papers I read, the more I am suspicious. This "surprisingly" and other hyperbole is just to make reviewers think the authors actually did something interesting/exciting. But the more "surprises" there are in a paper, the more I am suspicious of it. Often such hyperbole ought to be at best ignored, at worst the exact opposite needs to be examined.
It seems like the best students/people eventually end up doing CS research in their spare time while working as engineers. This is not the case for many other disciplines, where you need e.g. a lab to do research. But in CS, you can just do it from your basement, all you need is a laptop.
Well, you still need time (and permission from your employer)! Research is usually a more than full time job on its own.
4% is yuuuge. In hard projects, 1% is the difference between getting it right with an elegant design or going completely off the rails.