← Back to context Comment by paoliniluis 6 days ago just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second 4 comments paoliniluis Reply tasuki 6 days ago Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params? causal 6 days ago You COULD even do Qwen3.5-35B-A3B-GGUF.UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUFTo accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster. causal 6 days ago 3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.9B will be faster, however. andai 3 days ago See also: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
tasuki 6 days ago Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params? causal 6 days ago You COULD even do Qwen3.5-35B-A3B-GGUF.UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUFTo accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster. causal 6 days ago 3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.9B will be faster, however. andai 3 days ago See also: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
causal 6 days ago You COULD even do Qwen3.5-35B-A3B-GGUF.UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUFTo accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster.
causal 6 days ago 3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.9B will be faster, however.
Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params?
You COULD even do Qwen3.5-35B-A3B-GGUF.
UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
To accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.
Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster.
3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.
9B will be faster, however.
See also: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks