Comment by mycall

1 month ago

Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.

3 comments

mycall

embedding-shape 1 month ago

Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

Both have their places and are complementary, rather than competitors :)

tarruda 1 month ago

I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.

mycall 1 month ago

You can get concurrency gains [0] as local/single user (multi-agent) use case with vLLM with your Mac Studio.
[0] https://youtu.be/Ze5XLooTt6g?t=658