Comment by popinman322

10 months ago

You can swap experts in and out of VRAM, it just increases inference time substantially.

Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.

1 comment

popinman322

boroboro4 10 months ago

Chosen expert (on each layer) depends on the input of previous layer. Not sure how you can preload the experts before forward pass.