Comment by tempest_

15 hours ago

I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

2 comments

tempest_

They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.

hakanderyal 12 hours ago

CC can explain it clearly, which how I learned about how the inference stack works.