Comment by maybe_pablo

15 hours ago

Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressive sampling constraints, lossy speculative decoding and probably more.

2 comments

maybe_pablo

trollbridge 10 hours ago

I can't prove any of it, but it sure feels like that happens sometimes on Anthropic's platform.

I don't seem to get any of this with GPT-5.5 or GPT-5.5-Pro (not that I use 5.5-Pro enough to know for sure, but when I do use it, it never seems nerfed).

alfiedotwtf 13 hours ago

I'm pretty sure you could do n-expert capping on any MoE model with only a handful lines of changes to ik_llama.cpp, but yeah... my bet is the have various quantisations and run the lower ones at peak (along with different system prompts i.e we're GPU-bound right now. Get to the point with less chatter)