Comment by behat

18 hours ago

This is a very interesting read on failure modes of AI agents in prod.

Curious about this section on the system prompt change: >> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.

Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?