Comment by maybe_pablo
15 hours ago
Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressive sampling constraints, lossy speculative decoding and probably more.
15 hours ago
Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressive sampling constraints, lossy speculative decoding and probably more.
I can't prove any of it, but it sure feels like that happens sometimes on Anthropic's platform.
I don't seem to get any of this with GPT-5.5 or GPT-5.5-Pro (not that I use 5.5-Pro enough to know for sure, but when I do use it, it never seems nerfed).
I'm pretty sure you could do n-expert capping on any MoE model with only a handful lines of changes to ik_llama.cpp, but yeah... my bet is the have various quantisations and run the lower ones at peak (along with different system prompts i.e we're GPU-bound right now. Get to the point with less chatter)