Comment by nessex
14 hours ago
They would quantize the model. That'd make it cheaper to run, and have slightly worse output but it would still generate outputs with a similar feel, derived from a compressed version of the same knowledge base etc.
They wouldn't even need to do this uniformly, quantized versions of the model could be routed only a subset of the requests. They could do this to nerf the old model, or more likely just to give themselves more hardware to run the new one on by handling more requests on less hardware. Or to handle increased request volume as traffic ramps up faster than hardware can be provisioned.
Playing with local models at various quants, the degradation can be hard to spot. Sometimes it's only noticeable in aggregate. And even then, you never really know if you just got unlucky with a bad response due to RNG.
I've had Opus 4.6 fall into some weirdly incoherent loops that I rarely see from even Sonnet, that felt like the kind of thing I got frequently with Qwen3.5 9B on local. And the above applies... Was that just bad RNG? Or was my request to Opus routed to some lower quality variant? There's no great way for me to tell for any given request, nor any way to guarantee Anthropic _didn't_ do that.
I have had the same experiences you've had with 4.6 and it was ever since they brought out 4.7. It's fairly obvious they're doing something like you've said here.
Forgot to mention, but it was after the 4.7 release when I was still using 4.6 that I saw those loops too... Before that, 4.6 had been a pretty seamless experience.
And guess what all the providers of open models do: They quantize, badly.
This is why you pay premium for trusted providers, who are verified to not quantize