Comment by proxysna
6 hours ago
Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.
What would be these additional vllm flags, if you don't mind sharing?
question: why not use something like Claude? is it for security reasons?
Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.
I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?
I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.
But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.
So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.