Comment by mycall
3 hours ago
Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.
3 hours ago
Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.
I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.
Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.
Both have their places and are complementary, rather than competitors :)