Comment by bick_nyers
2 months ago
Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.
No comments yet
Contribute on Hacker News ↗