← Back to context

Comment by igravious

14 days ago

> making the load distribute evenly, nothing else.

so you mean a "load balancer" for neural nets … well, why don't they call it that then?

Some load balancers are also routers (if they route based on service capability and not just instantaneous availability) or vice versa, but this kind isn't always, to my understanding: The experts aren't necessarily "idle" or "busy" at any given time (they're just functions to be invoked, i.e. generally data, not computing resources), but rather more or less likely to answer correctly.

Even in the single GPU case, this still saves compute over the non-MoE case.

I believe it's also possible to split experts across regions of heterogeneous memory, in which case this task really would be something like load balancing (but still based on "expertise", not instantaneous expert availability, so "router" still seems more correct in that regard.)