Comment by 0-_-0

3 months ago

The cache gets read at every token generated, not at every turn on the conversation.

8 comments

0-_-0

Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.

0-_-0 3 months ago
What's in the prompt cache?
- mzl 3 months ago
  
  The prompt cache caches KV Cache states based on prefixes of previous prompts and conversations. Now, for a particular coding agent conversation, it might be more involved in how caching works (with cache handles and so on), I'm talking about the general case here. This is a way to avoid repeating the same quadratic cost computing over the prompt. Typically, LLM providers have much lower pricing for reading from this cache than computing again.
  Since the prompt cache is (by necessity, this is how LLMs work) prefix of a prompt, if you have repeated API calls in some service, there is a lot of savings possible by organizing queries to have less commonly varying things first, and more varying things later. For example, if you included the current date and time as the first data point in your call, then that would force a recomputation every time.
  
  4 replies →
- bsenftner 3 months ago
  
  Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.