Comment by vidarh
5 days ago
Without measuring quality of output, this seems irrelevant to me.
My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.
Performance is not a consideration.
If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.
CLAUDE.md isn't a silver bullet either, I've had it lose context a couple of questions deep. I do like GSD[1] though, it's been a great addition to the stack. I also use multiple, different LLMs as a judge for PRs, which captures a load of issues too.
[1] https://github.com/gsd-build/get-shit-done
In this context, "performance" means "does it do what we want it to do" not "does it do it quickly". Quality of output is what they're measuring, speed is not a consideration.
The point is that whether it does what you tell it in a single iteration is less important then whether it avoids stupid mistakes. Any serious use will put it in a harness.
My point is that you misread the comment you replied to. (By the way, on page 2 of the paper: "we evaluate each LLM only within its corresponding harness.")
2 replies →