Often in MoE models the experts are quantized while the shared portions, being a much smaller part of the network with greater impact, are kept at higher or full precision. Not familiar with the Kimi QAT approach specifically but it's likely they do this.
Often in MoE models the experts are quantized while the shared portions, being a much smaller part of the network with greater impact, are kept at higher or full precision. Not familiar with the Kimi QAT approach specifically but it's likely they do this.