Comment by jimmyl02

8 months ago

the most unintuitive part is that from my understanding, individual tokens are routed to different experts. this is hard to comprehend with "experts" as that means two you can have different experts for two sequential tokens right?

I think where MoE is misleading is that the experts aren't what we would call "experts" in the normal world but rather they are experts for a specific token. that concept feels difficult to grasp.

14 comments

jimmyl02

phire 8 months ago

It's not even per token. The routing happens once per layer, with the same token bouncing between layers.

It's more of a performance optimization than anything else, improving memory liquidity. Except it's not an optimization for running the model locally (where you only run a single query at a time, and it would be nice to keep the weights on the disk until they are relevant).

It's a performance optimization for large deployments with thousands of GPUs answering tens of thousands of queries per second. They put thousands of queries into a single batch and run them in parallel. After each layer, the queries are re-routed to the GPU holding the correct subset of weights. Individual queries will bounce across dozens of GPUs per token, distributing load.

Even though the name "expert" implies they should experts in a given topic, it's really not true. During training, they optimize for making the load distribute evenly, nothing else.

phire 8 months ago
BTW, I'd love to see a large model designed from scratch for efficient local inference on low-memory devices.
While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens.
Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model.
- regularfry 8 months ago
  
  Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.
  
  2 replies →
- boroboro4 8 months ago
  
  DeepSeek introduced novel experts training technique which increased experts specialization. For particular given domain their implementation tends to activate same experts between different tokens, which is kinda what you’re asking for!
- jumski 8 months ago
  
  I think Gemma 3 is marketed for single GPU setups https://blog.google/technology/developers/gemma-3/
idonotknowwhy 8 months ago

> It's not even per token. The routing happens once per layer, with the same token bouncing between layers.
They don't really "bounce around" though do they (during inference)? That implies the token could bounce back from eg. layer 4 -> layer 3 -> back to layer 4.
mentalgear 8 months ago

So a more correct term would be "Distributed Loading" instead of MoE.
igravious 8 months ago
> making the load distribute evenly, nothing else.
so you mean a "load balancer" for neural nets … well, why don't they call it that then?
- lxgr 8 months ago
  
  Some load balancers are also routers (if they route based on service capability and not just instantaneous availability) or vice versa, but this kind isn't always, to my understanding: The experts aren't necessarily "idle" or "busy" at any given time (they're just functions to be invoked, i.e. generally data, not computing resources), but rather more or less likely to answer correctly.
  Even in the single GPU case, this still saves compute over the non-MoE case.
  I believe it's also possible to split experts across regions of heterogeneous memory, in which case this task really would be something like load balancing (but still based on "expertise", not instantaneous expert availability, so "router" still seems more correct in that regard.)

bonoboTP 8 months ago

Also note that MoE is a decades old term, predating deep learning. It's not supposed to be interpreted literally.

tomp 8 months ago

> individual tokens are routed to different experts

that was AFAIK (not an expert! lol) the traditional approach

but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!

wrs 8 months ago

ML folks tend to invent fanciful metaphorical terms for things. Another example is “attention”. I’m expecting to see a paper “consciousness is all you need” where “consciousness” turns out to just be a Laplace transform or something.