← Back to context

Comment by mordae

7 hours ago

Look at GB/s.

Strix halo has 256 GB/s bandwidth for $2500. The Flash model has 13 GB activations.

256 / 13 = 19.6 tokens per second

Except you cannot fit it into the maximum RAM of 128 GB Strix Halo supports. So move on.

Another option is Threadripper. That's 8 memory channels. Using older DDR4-3200 you get roughly 200 GB/s. For $2000.

200 / 13 = 15.4 tokens per second

But, a chunk of per-token weights is actually always the same and not MoE, so you would offload that to a GPU and get a decent speedup. Say 25 tokens per second total.

Then likely some expensive Mac. No idea.

Eventually you arrive at a mining rig chassis with a beefy board and multiple GPUs. That has the benefit of pipelining. You run part of the model on one GPU and move on, so another batch can start on the first one. Low (say 30-100) tps individually, but a lot more in parallel. Best get it with other people.