Comment by zozbot234

6 months ago

This is not quite correct. Ollama must assess the state of Vulkan support and amount of available memory, then pick the fraction of the model to be hosted on GPU. This is not totally foolproof and will likely always need manual adjustment in some cases.

the work involved is tiny compared to the work llama.cpp did to get vulkan up and running.

this is not rocket science.