Comment by devonkelley

24 days ago

Running agents in production, I've stopped trying to figure out why things degrade. The answer changes weekly.

Model drift, provider load, API changes, tool failures - it doesn't matter. What matters is that yesterday's 95% success rate is today's 70%, and by the time you notice, debug, and ship a fix, something else has shifted.

The real question isn't "is the model degraded?" It's "what should my agent do right now given current conditions?"

We ended up building systems that canary multiple execution paths continuously and route traffic based on what's actually working. When Claude degrades, traffic shifts to the backup path automatically. No alerts, no dashboards, no incident.

Treating this as a measurement problem assumes humans will act on the data. At scale, that assumption breaks.

1 comment

devonkelley

sd9 23 days ago

LLM generated comments are so obvious, please just talk from your personal experience. Nobody cares about this imagined experience.