← Back to context

Comment by tarruda

6 hours ago

These days I don't feel the need to use anything other than llama.cpp server as it has a pretty good web UI and router mode for switching models.

I mostly use LM Studio for browsing and downloading models, testing them out quickly, but then actually integrating them is always with either llama.cpp or vLLM. Curious to try out their new cli though and see if it adds any extra benefits on top of llama.cpp.

Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.

  • I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.

  • Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

    Both have their places and are complementary, rather than competitors :)