Comment by onli

5 hours ago

Fact of the matter is that I have a Radeon RX 6600, which I can't use with ollama. First, there is no ROCm at all in my distros repository - it doesn't compile reliably and needs too many ressources. Then, when compiling it manually, it turns out that ROCm doesn't even support the card in the first place.

I'm aware that 8GB Vram are not enough for most such workloads. But no support at all? That's ridiculous. Let me use the card and fall back to system memory for all I care.

Nvidia, as much as I hate their usually awfully insufficient linux support, has no such restrictions for any of their modern cards, as far as I'm aware.

5 comments

onli

pja 35 minutes ago

My recent experience has been that the Vulkan support in llama.cpp is pretty good. It may lag behind Cuda / Metal for the bleeding edge models if they need a new operator.

Try it out! Benchmarks here: https://github.com/ggml-org/llama.cpp/discussions/10879

(ollama doesn’t support vulkan for some weird reason. I guess they never pulled the code from llama.cpp)

onli 25 minutes ago

Thanks, I might indeed give this a test!

smerrill 4 hours ago

You should be able to use ollama’s Vulkan backend and in my experience the speed will be the same. (I just spent a bunch of time putting Linux on my 2025 ASUS ROG Flow Z13 to use ROCm, only to see the exact same performance as Vulkan.)

onli 2 hours ago

That would mean switching to https://github.com/whyvl/ollama-vulkan? I see no backend selection in ollama, nor anything in the faq.

yjftsjthsd-h 2 hours ago

> I'm aware that 8GB Vram are not enough for most such workloads. But no support at all? That's ridiculous. Let me use the card and fall back to system memory for all I care.

> Nvidia, as much as I hate their usually awfully insufficient linux support, has no such restrictions for any of their modern cards, as far as I'm aware.

In fact, I regularly run llamafile (and sometimes ollama) on an nvidia dGPU in a laptop, with 4GB of VRAM, and it works fine (ish... I mostly do the thing where some layers are on the GPU and some are CPU; it's still faster than pure CPU so whatever).