Comment by sofixa
1 day ago
You can run vLLM with AMD GPUs supported by ROCm: https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/infer...
However from experience with an AMD Strix Halo, a couple of caveats: it's drastically slower than Ollama (tested over a few weeks, always using the official AMD vLLM nightly releases), and not all GPUs were supported for all models (but that has been fixed).
vLLM ususally only plays out its strength when serving multiple users in parallel, in contrast to llama.cpp (Ollama is a wrapper around llama.cpp).
If you want more performance, you could try running llama.cpp directly or use the prebuilt lemonade nightlies.
But vLLM was half the t/s of Ollama, so something was obviously not ok.