Comment by pamelafox
6 days ago
This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
You can also save time/tokens if you see that every request starts looking for the same information. You can front-load it.
Also take the randomness out of it. Sometimes the agent executing tests one way, sometimes the other way.
I've found https://github.com/casey/just to be very very useful. Allows to bind common commands simple smaller commands that can be easily referenced. Good for humans too.
Don't forget to update it regularly then
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.
same with people, no matter what info you give a person you cant be sure they will follow it the same every time
Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.