Comment by wgd
3 days ago
Often in MoE models the experts are quantized while the shared portions, being a much smaller part of the network with greater impact, are kept at higher or full precision. Not familiar with the Kimi QAT approach specifically but it's likely they do this.
No comments yet
Contribute on Hacker News ↗