← Back to context

Comment by unleaded

11 hours ago

Qwen3.6-35B-A3B-UD-Q4_K_M runs at about 11 tokens/second on my poor old 1060. Absolutely nuts how far we've come

I tried running any model on my 1070 and it instantly crashes my old tower, probably time to get off windows and run linux on it.

Mind sharing your llama.cpp settings for that?

  •   .\llama-server.exe -m ..\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -ngl 999 --n-cpu-moe 41 -c 262144 --port 8081 --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3 --no-mmap --mlock --host 0.0.0.0 -t 8 -tb 8 -np 1
    

    Using this llama.cpp fork https://github.com/TheTom/llama-cpp-turboquant and mostly copying from this video https://www.youtube.com/watch?v=8F_5pdcD3HY

    Haven't had much time to test it other than asking a few questions & changing some HTML in cline so it might be thick as a brick for all I know, but still worth trying

    • I just tested it with some risc-v code and it wrote down a "mov" instruction several times.. yeah something needs tuning maybe