Honestly, it's empirical. We started with what was easiest to measure: human correction rate. If I had to step in and fix something, that's a clear signal the agent took a bad path.
Iterations and reverts turned out to be noisier -- sometimes high iteration count means the task was genuinely hard, not that the agent made a mistake. So we downweighted those.
The meta-answer is: pick the metric that most directly captures "I wish the agent hadn't done that." For us that's human intervention. For a team with better test coverage, it might be test failures after commit. For infra work, maybe rollback frequency.
There's no universal loss function — it depends on where your pain actually is. We just made it explicit and started logging it. The logging alone forced clarity.
Honestly, it's empirical. We started with what was easiest to measure: human correction rate. If I had to step in and fix something, that's a clear signal the agent took a bad path. Iterations and reverts turned out to be noisier -- sometimes high iteration count means the task was genuinely hard, not that the agent made a mistake. So we downweighted those. The meta-answer is: pick the metric that most directly captures "I wish the agent hadn't done that." For us that's human intervention. For a team with better test coverage, it might be test failures after commit. For infra work, maybe rollback frequency. There's no universal loss function — it depends on where your pain actually is. We just made it explicit and started logging it. The logging alone forced clarity.