Comment by shubhamintech

5 hours ago

The test harness point is spot on but there's a gap worth naming: the failure modes you write evals for aren't the ones that cause users to churn. Prod conversations have a whole category where the agent doesn't error, it just confidently goes sideways in a way nobody wrote a test for. The teams actually retaining users from AI products are reading conversations, not just dashboards.

0 comments