Comment by omneity

10 months ago

I'm using Qwen3-30B-A3B locally and it's very impressive. Feels like the GPT-4 killer we were waiting for for two years. I'm getting 70 tok/s on an M3 Max, which is pushing it into the "very usable" quadrant.

What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.

Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.

28 comments

omneity

TuxSH 10 months ago

Personally, I'm getting 15 tok/s on both RTX 3060 and my MacBook Air M4 (w/ 32GB, but 24 should suffice), with the default config from LMStudio.

Which I find even more impressive, considering the 3060 is the most used GPU (on Steam) and that M4 Air and future SoCs are/will be commonplace too.

(Q4_K_M with filesize=18GB)

anon373839 10 months ago

One of the most interesting things about that model is its excellent score on the RAG confabulations (hallucination) leaderboard. It’s the 3rd best model overall, beating all OpenAI models, for example. I wonder what Alibaba did to achieve that.

https://github.com/lechmazur/confabulations

c0brac0bra 10 months ago

What tasks have you found the 0.6B model useful for? The hallucination that's apparent during its thinking process put up a big red flag for me.

Conversely, the 4B model actually seemed to work really well and gave results comparable to Gemini 2.0 Flash (at least in my simple tests).

SparkyMcUnicorn 10 months ago

You can use 0.6B for speculative decoding on the larger models. It'll speed up 32B, but slows down 30B-A3B dramatically.
omneity 10 months ago

It's okay for extracting simple things like addresses, or for formatting text with some input data, like a more advanced form of mail merge.
I haven't evaled these tasks so YMMV. I'm exploring other possibilities as well. I suspect it might be decent at autocomplete, and it's small enough one could consider finetuning it on a codebase.

jasonjmcghee 10 months ago

Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.

The /think and /no_think commands are very convenient.

woadwarrior01 10 months ago

That should not be the case. Speculative decoding is trading off compute for memory bandwidth. The model's output is guaranteed to be the same, with or without it. Perhaps there's a bug in the implementation that you're using.
marcalc 10 months ago
What do you mean by draft model? And how would one disable it? Cheers
- _neil 10 months ago
  
  A draft model is something that you would explicitly enable. It uses a smaller model to speculatively generate next tokens, in theory speeding up generation.
  Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding

mtw 10 months ago

how much RAM do you have? I want to compare with my local setup (M4 Pro)

dust42 10 months ago

I have a MBP M1 Max 64GB and I get 40t/s with llama.cpp and unsloth q4_k_m on the 30B A3B model. I always use /nothink and Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 - these are the settings recommended for Qwen3 and they make a big difference. With the default settings from llama-server it will always run into an endless loop.
The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
[1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B
[2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
[3] llama.cpp: https://github.com/ggml-org/llama.cpp
omneity 10 months ago
128GB but it's not using much.
I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.
- A4ET8a8uTh0_v2 10 months ago
  
  I am not a mac person, but I am debating buying one for the unified ram now that the prices seem to be inching down. Is it painful to set up? The general responses I seem to get range from "It is takes zero effort" to "It was a major hassle to set everything up."
  
  11 replies →

tmaly 10 months ago

I am curious if you have tried function calling or MCPs with it?

UK-Al05 10 months ago

It's fits entirely in my 7900xtx memory. But tbh i've been disappointed with programming ability so far.

It's using 20GB of memory according to ollama.

tomr75 10 months ago

I’m getting 56 with mlx and lmstudio. How 76?

avetiszakharyan 10 months ago

I went form bottom up, started with 4B, then 8B, then 30B, and 30B was the only one that started to "use tools". Other models were saying they will use but never did, or didnt notice all the tools. I think anything above 30b would atually be able to go full GPT on a task. 30b does it, but a bit.. meh