Comment by EnPissant

1 day ago

For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits:

- Prompt processing 65k tokens: 4818 tokens/s

- Token generation 8k tokens: 221 tokens/s

If I offload just the experts to run on the CPU I get:

- Prompt processing 65k tokens: 3039 tokens/s

- Token generation 8k tokens: 42.85 tokens/s

As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.

0 comments

EnPissant

No comments yet

Contribute on Hacker News ↗