Comment by barbegal
4 hours ago
Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.
For long context, yes this is at least plausible. And the latest models are reaching context lengths of 1M tokens or perhaps more.