← Back to context

Comment by embedding-shape

3 hours ago

Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

Both have their places and are complementary, rather than competitors :)