However from experience with an AMD Strix Halo, a couple of caveats: it's drastically slower than Ollama (tested over a few weeks, always using the official AMD vLLM nightly releases), and not all GPUs were supported for all models (but that has been fixed).
ZLUDA implements CUDA on top of AMD ROCm - they are explicitly targetting vLLM as their PyTorch compatibility test: https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2025/#pyt...
(PyTorch does also support ROCm generally, it shows up as a CUDA device.)
I feel like these technologies are named by the Polish at the companies. "CUDA" means "WONDERS" and "ZŁUDA" would be an "ILLUSION".
ZLUDA was definitely intentional: https://github.com/vosen/ZLUDA/discussions/192
You can run vLLM with AMD GPUs supported by ROCm: https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/infer...
However from experience with an AMD Strix Halo, a couple of caveats: it's drastically slower than Ollama (tested over a few weeks, always using the official AMD vLLM nightly releases), and not all GPUs were supported for all models (but that has been fixed).
vLLM ususally only plays out its strength when serving multiple users in parallel, in contrast to llama.cpp (Ollama is a wrapper around llama.cpp).
If you want more performance, you could try running llama.cpp directly or use the prebuilt lemonade nightlies.
But vLLM was half the t/s of Ollama, so something was obviously not ok.