Comment by idonotknowwhy
12 days ago
> It's not even per token. The routing happens once per layer, with the same token bouncing between layers.
They don't really "bounce around" though do they (during inference)? That implies the token could bounce back from eg. layer 4 -> layer 3 -> back to layer 4.
No comments yet
Contribute on Hacker News ↗