Comment by magic_hamster

9 hours ago

Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length.

1 comment

magic_hamster

fortyseven 6 hours ago

Strangely, I haven't had a lot of luck with vLLM; I finally ended up ditching Ollama and going straight to the tap with llama-serve in llamacpp. No regrets.