← Back to context

Comment by tomp

14 days ago

> individual tokens are routed to different experts

that was AFAIK (not an expert! lol) the traditional approach

but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!