Comment by boroboro4
7 months ago
Check out DeepSeek v3 model paper. They changed the way they train experts (went from aux loss to different kind expert separation training). It did improve experts domain specialization, they have neat graphics on it in the paper.
No comments yet
Contribute on Hacker News ↗