← Back to context

Comment by unleaded

8 hours ago

Qwen3.6-35B-A3B-UD-Q4_K_M runs at about 11 tokens/second on my poor old 1060. Absolutely nuts how far we've come

I tried running any model on my 1070 and it instantly crashes my old tower, probably time to get off windows and run linux on it.

  • Understated how much of a boon for Linux that AI development has been.

    There isn’t any benefit to running a windows machine.

    • Au contraire, I run models on WSL and my desktop reliably wakes up from sleep. Best of both worlds.

  • Sounds like a hardware issue, though NVIDIA driver issues can't be ruled out, they're much rarer these days

Mind sharing your llama.cpp settings for that?

  •   .\llama-server.exe -m ..\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -ngl 999 --n-cpu-moe 41 -c 262144 --port 8081 --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3 --no-mmap --mlock --host 0.0.0.0 -t 8 -tb 8 -np 1
    

    Using this llama.cpp fork https://github.com/TheTom/llama-cpp-turboquant and mostly copying from this video https://www.youtube.com/watch?v=8F_5pdcD3HY

    Haven't had much time to test it other than asking a few questions & changing some HTML in cline so it might be thick as a brick for all I know, but still worth trying

    • I just tested it with some risc-v code and it wrote down a "mov" instruction several times.. yeah something needs tuning maybe