Comment by segmondy
1 day ago
use llama.cpp, you will be surprised how fast a model like qwen3.5-35b-a3b will run. that a3b means only 3B active parameter, so while infering the entire 3B will be in your GPU and you will get amazing performance. for your system, you should use the -cmoe option
No comments yet
Contribute on Hacker News ↗