Comment by mordae
8 hours ago
Look at GB/s.
Strix halo has 256 GB/s bandwidth for $2500. The Flash model has 13 GB activations.
256 / 13 = 19.6 tokens per second
Except you cannot fit it into the maximum RAM of 128 GB Strix Halo supports. So move on.
Another option is Threadripper. That's 8 memory channels. Using older DDR4-3200 you get roughly 200 GB/s. For $2000.
200 / 13 = 15.4 tokens per second
But, a chunk of per-token weights is actually always the same and not MoE, so you would offload that to a GPU and get a decent speedup. Say 25 tokens per second total.
Then likely some expensive Mac. No idea.
Eventually you arrive at a mining rig chassis with a beefy board and multiple GPUs. That has the benefit of pipelining. You run part of the model on one GPU and move on, so another batch can start on the first one. Low (say 30-100) tps individually, but a lot more in parallel. Best get it with other people.
No comments yet
Contribute on Hacker News ↗