Comment by barbegal

4 hours ago

Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.

For long context, yes this is at least plausible. And the latest models are reaching context lengths of 1M tokens or perhaps more.