Comment by pclmulqdq

2 months ago

I assume KV caching makes this a non issue, but I'm also curious.

2 comments

pclmulqdq

If you're just chatting with it starting with "Hi", that's correct. The conversation remains in the KV cache as it grows gradually.

But if you're posting code, writing drafts, or even small snippets of articles, etc in there it becomes a huge problem.

pclmulqdq 2 months ago

Usually, when people think about the prompt tokens for a chat model, the initial system prompt is the vast majority of the tokens and it's the same regardless for many usage modes. You might have a slightly different system prompt for code than you have for English or for chatting, but that is 3 prompts which you can permanently put in some sort of persistent KV cache. After that, only your specific request in that mode is uncached.