Comment by drivebyhooting

1 day ago

I have a lenovo workstation with 256GB ram but a weak sauce 12GB VRAM GPU. Is there any DMA trick to improve offload performance?

use llama.cpp, you will be surprised how fast a model like qwen3.5-35b-a3b will run. that a3b means only 3B active parameter, so while infering the entire 3B will be in your GPU and you will get amazing performance. for your system, you should use the -cmoe option