Comment by steren
7 days ago
> I would never want to use something like ollama in a production setting.
We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.
you need to benchmark against llama.cpp as well.
Did you test multi-user cases?
Assuming this is equivalent to parallel sessions, I would hope so, this is like the entire point of vLLM
vllm and ollama assume different settings and hardware. Vllm backed by the paged attention expect a lot of requests from multiple users whereas ollama is usually for single user on a local machine.