Comment by akulbe

7 hours ago

Any chance you'd be willing to talk further about your setup? I have 2 x 3090s in a local machine, and I'm still left with questions about how best to use stuff locally.

9 comments

akulbe

sheeshkebab 5 hours ago

You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) - and you probably want moe versions like Qwen 35b for this to run at speed somewhat comparable to Claude. It’s still not there to be honest but getting there. Personally I mess around with llama.cpp on m5 max with 128gb - it’s a decent setup to try various medium sized things, and runs llms surprisingly well without quantization, at least the moe models.

akulbe 1 hour ago

How is that machine for local inference? It's a serious consideration for me, but getting to hear more from folks that already have it would be helpful.
SwellJoe 5 hours ago
Two 3090s is 48GB, so it's possible to run the 6-bit quantization comfortably, which is fine. It doesn't start to get notably dumber until lower than that. It won't be as fast as a hosted model, but dual 3090s will be comfortably fast for interactive use with the MoE version and not terrible to use with the dense model. I run the dense model at 8 bits on my dual Radeon V620 desktop machine, which I think would be slower than two 3090s, or at least not notably faster.
- hedgehog 5 hours ago
  
  Have you done comparisons with 4 bit and seen a noticeable difference for coding tasks?
  
  5 replies →