Comment by zozbot234
6 months ago
This is not quite correct. Ollama must assess the state of Vulkan support and amount of available memory, then pick the fraction of the model to be hosted on GPU. This is not totally foolproof and will likely always need manual adjustment in some cases.
the work involved is tiny compared to the work llama.cpp did to get vulkan up and running.
this is not rocket science.
This sounds like it should be trivial to reproduce and extend - I look forward to trying out your repo!
the owner of that PR has already forked ollama. try it out. I did and it works great.
1 reply →