Comment by ianbutler
14 days ago
This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.
Queries are then also dynamically routed.
No comments yet
Contribute on Hacker News ↗