← Back to context

Comment by svnt

21 hours ago

Maybe I am misunderstanding something but:

1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.

2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).

It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.