Comment by fy20

2 months ago

I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.

2 comments

fy20

Twirrim 2 months ago

I'm getting about 15-20 tok/s with a 128k context window using the Q3_K_S version.

For running the server:

    $ ./llama.cpp/build/bin/llama-server --host 0.0.0.0 \
      --port 8001 \
      -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q3_K_S \
      --ctx-size 131072 \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.00

manmal 2 months ago

In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously