Comment by 2ndorderthought
19 hours ago
Well about 4 weeks ago I was mostly running small models. Some of my favorites were deepseek r1 8b and qwen 3.5 9b. Those are more or less good for boiler plate super fast responses(what I cared about most).
Now I am still trying out all the models that dropped this month. I am running qwen 3.6 35 a3b on a 16gb vram rtx 4060 ti.
I wish I sprung for a 24gb vram card but I never thought the price difference would matter. It seems like it does and I bet in the future there will be more models at this size because this is crazy.
It's not as good as opus if you are doing completely hands off programming but it's completely fine for me. I mostly use it for auto complete or templating a class. Other people are using it for agentic workflows with success.
Check out /r/localllama for more experiences. My set up is not the best but it is working for me and is saving me money.
> My set up is not the best but it is working for me and is saving me money.
I've got a local setup too but unless you consider hardware zero cost, there is really no way to save money. The class of model you can run on <$5k of hardware is dirt cheap to run in the cloud (generating tokens 24/7 non-stop is a few dollars a day at most, possibly even less than the cost of electricity to do it at home).
There's truth to that. But, I already had the card for other purposes. And I don't have to egress or ingress anything. I love having it all local to me. I also love how I can sell the card later. Funny thing, my GPU has gone up in price so I might even have made money