There are some days where it acts staggeringly bad, beyond baselines.
But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…
There’s too many variables and no hard evidence shared by Anthropic.
There are some days where it acts staggeringly bad, beyond baselines.
But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…
There’s too many variables and no hard evidence shared by Anthropic.
I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.
Not when the product is marketed as a panacea.
No because switching to the API with the same prompt immediately fixes it.
There's little incentive to throttle the API. It's $/token.