Comment by vlovich123
14 days ago
Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.
That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.
No comments yet
Contribute on Hacker News ↗