Comment by stavros
6 days ago
Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.
6 days ago
Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.
The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.