← Back to context

Comment by SwellJoe

10 hours ago

We're having DeepSeek moments every couple of weeks.

Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).

> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.

We have Qwen 3.6-35b (6) on a 5090 (32GB) and it's blowing me away. Works fine for most (not all) code generation tasks. One developer here has been extremely stubborn about adopting AI; he's finally adopted it, albeit only when it's coming from a local model like this.

DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.

  • Any chance you'd be willing to talk further about your setup? I have 2 x 3090s in a local machine, and I'm still left with questions about how best to use stuff locally.

    • You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) - and you probably want moe versions like Qwen 35b for this to run at speed somewhat comparable to Claude. It’s still not there to be honest but getting there. Personally I mess around with llama.cpp on m5 max with 128gb - it’s a decent setup to try various medium sized things, and runs llms surprisingly well without quantization, at least the moe models.

      8 replies →