Comment by tarruda

22 days ago

These days I don't feel the need to use anything other than llama.cpp server as it has a pretty good web UI and router mode for switching models.

6 comments

tarruda

roger_ 22 days ago

MLX support on Macs was the main reason for me.

embedding-shape 22 days ago

I mostly use LM Studio for browsing and downloading models, testing them out quickly, but then actually integrating them is always with either llama.cpp or vLLM. Curious to try out their new cli though and see if it adds any extra benefits on top of llama.cpp.

mycall 22 days ago

Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.

embedding-shape 22 days ago

Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.
Both have their places and are complementary, rather than competitors :)
tarruda 22 days ago
I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.
- mycall 21 days ago
  
  You can get concurrency gains [0] as local/single user (multi-agent) use case with vLLM with your Mac Studio.
  [0] https://youtu.be/Ze5XLooTt6g?t=658