Comment by rsolva

3 hours ago

In our company of 24 employees, we get by with two DGX Sparks. We don't use AI heavily, but each Spark can serve about 6-8 concurrent requests with a full context lenght of 256k, which is decent. We get about ~35 t/s depending on the model we use (currently Qwen3.5 122B A10B and Qwen3 Coder Next), but we might set up a smaller model too for simpler tasks.

This works for us and will work for years to come. It is not SOTA, but it works darn well for our purposes, and we control the compute and data flowing through it, so totally worth it.

1 comment

rsolva

zozbot234 2 hours ago

That's pretty nice actually, how much KV cache does that model require at full context? That tends to be the main limit to running concurrent requests locally, there's KV quantization but it has outsized negative impact on model quality.