Oh, it'll never run on a 4090. 17B is the active parameter count, not the total param count (and "active" doesn't mean you can slice just those params out and put them on the GPU — which parameters are active constantly changes, even per-token. "Active" just means you get tokens faster than a dense model). It's 109B total parameters, so you'd need at least 54.5GB VRAM just for the weights alone.
A Framework Desktop, Mac Studio, or Nvidia DGX Spark should be able to handle the Scout model locally though... Maybe even at FP8, depending on how much context you need.
Well, Scout should run on the rumored 96GB 4090, since it runs on a single 80GB H100. But, yeah, it'd have to be at sub-2bit quantization to run on a standard 24GB.
True! A Framework Desktop or mid-tier Mac Studio would also work and would be cheaper — and you could even run Scout at FP8. A maxed-out Mac Studio could even handle Maverick at FP8, albeit at pretty high cost ($10k).
You can swap experts in and out of VRAM, it just increases inference time substantially.
Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.
You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.
I'm certainly not the brightest person in this thread but has there been effort to maybe bucket the computational cost of the model so that more expensive parts are on the gpu and less expensive parts are on the cpu?
Oh, it'll never run on a 4090. 17B is the active parameter count, not the total param count (and "active" doesn't mean you can slice just those params out and put them on the GPU — which parameters are active constantly changes, even per-token. "Active" just means you get tokens faster than a dense model). It's 109B total parameters, so you'd need at least 54.5GB VRAM just for the weights alone.
A Framework Desktop, Mac Studio, or Nvidia DGX Spark should be able to handle the Scout model locally though... Maybe even at FP8, depending on how much context you need.
Well, Scout should run on the rumored 96GB 4090, since it runs on a single 80GB H100. But, yeah, it'd have to be at sub-2bit quantization to run on a standard 24GB.
Sounds runnable on 2x5090 presumably for $4k if back in stock.
True! A Framework Desktop or mid-tier Mac Studio would also work and would be cheaper — and you could even run Scout at FP8. A maxed-out Mac Studio could even handle Maverick at FP8, albeit at pretty high cost ($10k).
It's still runnable locally. Just not on a 4090.
You can swap experts in and out of VRAM, it just increases inference time substantially.
Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.
Chosen expert (on each layer) depends on the input of previous layer. Not sure how you can preload the experts before forward pass.
Unless something’s changed you will need the whole model on the HPU anyway, no? So way beyond a 4090 regardless.
You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.
see ktransformers: https://www.reddit.com/r/LocalLLaMA/comments/1jpi0n9/ktransf...
I'm certainly not the brightest person in this thread but has there been effort to maybe bucket the computational cost of the model so that more expensive parts are on the gpu and less expensive parts are on the cpu?
1 reply →
A habana just for inference? Are you sure?
Also I see the 4 bit quants put it at a h100 which is fine ... I've got those at work. Maybe there will be distilled for running at home