Comment by Tepix
1 month ago
You could buy five Strix Halo systems at $2000 each, network them and run it.
Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.
1 month ago
You could buy five Strix Halo systems at $2000 each, network them and run it.
Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.
Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.
Check out https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/...
llama.cpp with rpc-server doesn't require a lot of bandwidth during inference. There is a loss of performance.
For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.
Llama.cpp with its rpc-server