Comment by tomp
14 days ago
> individual tokens are routed to different experts
that was AFAIK (not an expert! lol) the traditional approach
but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!
No comments yet
Contribute on Hacker News ↗