Comment by mashygpig
20 hours ago
It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison.
Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...
Even with deepseek v4 flash I burned though $5 in credits in a day just playing around with Hermes, and qwen 3.6 35B is significantly more expensive.
I can run qwen 3.6 35B on my gaming PC at around 50 tok/s and other than power cost of a tiny bit extra per month, it's hardware I already owned from years ago.
I'm not really sure why qwen 3.6 35B is so expensive on openrouter, it seems abnormally high for what hardware it takes to run it.
How do you run 35B on a gaming PC?
I'm trying to go the same route, but I have a 5070Ti with only 16GB VRAM (I bought it for gaming) and I'm not sure how to run anything decent on it. I have 64 GB RAM if that matters
I run it on a 12GB 4070 with 32GB system RAM. 35B A4B means only part of the model is active at a time so it takes a lot less VRAM than a dense 35B model would.
The main thing in LM studio (or whatever software you use, assuming it has fairly up to date stuff and exposes the toggles) is to offload MoE layers to the CPU, and use K/V cache quantization at Q8_0 or Q4_0.
Since you have more VRAM than I do, you could probably get away with MoE offload of like 15-20 so some remains on the GPU.
Just make sure GPU offload is turned all the way up. And I use 64k context size, although with 16GB VRAM you can probably do more.
You can find the best performance spot by playing with MoE offload until you find the number that gives the highest tok/s on your hardware.
CPU moe offloading. see e.g.
https://www.reddit.com/r/LocalLLaMA/comments/1t9eo83/running...
If you're not good at prompting yet, that $10 doesn't go very far. The local model allows me to learn what works and what doesn't without paying for tokens. Then when I know how not to waste them, I'll try a paid model.
There is one side effect of running your LLM locally: you stop thinking about the token budget. I often run `/goal` with no limits, or script an endless loop in bash to run opencode, etc. Sometimes I just brute force the task by throwing a /goal at it. Maybe it's not the most efficient use either, but it's nice to have the option.
Agreed, I'm waiting for the time when 48GB+ ram is just the standard that computers come with rather than being the absolute top tier option. It just doesn't make sense to spend extra on a local AI computer right now when the same money would last for a decade of API pricing.
Have you considered this may never happen? What if datacenters continue to swallow all capacity?
Those are all pre-rugpull prices though. Give it a year.