Comment by diyer22
3 days ago
I agree with @ActivePattern and thank you for your help in answering.
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).
No comments yet
Contribute on Hacker News ↗