← Back to context

Comment by popinman322

14 days ago

You can swap experts in and out of VRAM, it just increases inference time substantially.

Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.

Chosen expert (on each layer) depends on the input of previous layer. Not sure how you can preload the experts before forward pass.