Comment by proxysna

8 hours ago

Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.

6 comments

proxysna

tandr 7 hours ago

What would be these additional vllm flags, if you don't mind sharing?

proxysna 2 hours ago

This is from an example from my Nomad cluster with two a5000's, which is a bit different what i have at work, but it will mostly apply to most modern 24G vram nvidia gpu.
"--tensor-parallel-size", "2" - spread the LLM weights over 2 GPU's available
"--max-model-len", "90000" - I've capped context window from ~256k to 90k. It allows us to have more concurrency and for our use cases it is enough.
"--kv-cache-dtype", "fp8_e4m3", - On an L4 cuts KV cache size in half without a noticeable drop in quality, does not work on a5000, as it has no support for native FP8. Use "auto" to see what works for your gpu or try "tq3" once vllm people merge into the nightly.
"--enable-prefix-caching" - Improves time to first output.
"--speculative-config", "{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}", - Speculative mutli-token prediction. Qwen3.5 specific feature. In some cases provides a speedup of up to 40%.
"--language-model-only" - does not load vision encoder. Since we are using just the LLM part of the model. Frees up some VRAM.

PcChip 6 hours ago

question: why not use something like Claude? is it for security reasons?

lambda 5 hours ago
Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.
I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?
- cyanydeez 5 hours ago
  
  I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.
  But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.
  So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.
proxysna 2 hours ago

We do make Claude and Mistral available to our developers too. But, like you said, security. I, personally, do not understand how people in tech, put any amount of trust in businesses that are working in such a cutthroat and corrupt environment. But developers want to try new things and it is better to set up reasonable guardrails for when they want to use these thing by setting up a internal gateway and a set of reasonable policies.
And the other thing is that i want people to be able to experiment and get familiar with LLM's without being concerned about security, price or any other factor.