Comment by Tepix

1 month ago

You could buy five Strix Halo systems at $2000 each, network them and run it.

Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.

4 comments

Tepix

Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.

Tepix 24 days ago

Check out https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/...
Tepix 24 days ago

llama.cpp with rpc-server doesn't require a lot of bandwidth during inference. There is a loss of performance.
For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.
Tepix 1 month ago

Llama.cpp with its rpc-server