Comment by Twirrim
6 days ago
I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.
I'm getting about 15-20 tok/s with a 128k context window using the Q3_K_S version.
For running the server:
In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously
Can you give an example of some coding tasks? I had no idea local was that good.
Changed into a directory recently and fired up the qwen code CLI and gave it two prompts: "so what's this then?" - to which it had a good summary across stack and product, and then "think you can find something todo in the TODO?" - and while I was busy in Claude Code on another project, it neatly finished three HTML & CSS tasks - that I had been procrastinating on for weeks.
This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.
3.5 seems to be better at coding than 3-coder-next, I’d check it out.
I personally have used Qwen2.5-coder:14B for "live, talking rubber duck" sorts of things.
"I am learning Elixir, can you explain this code to me?" (And then I can also ask follow-up questions.)
"Here is a bunch of logs. Given that the symptom is that the system fails to process a message, what log messages jump out as suspicious for dropping a message?"
"Here is the code I want to test. <code> Here are the existing tests. <test code> What is one additional test you would add?"
"I am learning Elixir. Here is some code that fails to compile, here is the error message, can you walk me through what I did wrong?"
I haven't gotten much value out of "review this code", but maybe I'll have to try prompting for "persona: brief rude senior" as mentioned elsewhere.
3.5 is doing a good job of reviewing code, even without prompting it to be brief and/or rude.
I've been using opencode pointing to the local model running llama.cpp.
The last thing I was having it build is a rust based app that essentially pulls data from a set of APIs every 2 minutes, processes it and stores the data in a local database, with a half hourly task that does further analysis. It has done a decent job.
It's definitely not as fast or as good as large online models, but it's fast enough and good enough, and using hardware I already had spare.
Which models would that be?
unsloth's quantized ones. They mention on the site that this links to that a couple of days ago they released updated freshly quantized versions of Qwen3.5-35B, 27B, 122B and 397B, with various improvements.