Comment by nardi

2 months ago

Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.

5 comments

nardi

ryan_glass 2 months ago

Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.

ryao 2 months ago

If he is doing multiturn conversations, he can reuse the kv cache from the last turn and skip the prompt processing on the history that would make time to first token too slow, by only doing prompt processing on his actual prompt for the current turn. This turns a quadratic amount of tokens to process into a linear number. I am not sure if this is what he is doing, but that is what I would do if I had his hardware.

pclmulqdq 2 months ago

I assume KV caching makes this a non issue, but I'm also curious.

idonotknowwhy 2 months ago
If you're just chatting with it starting with "Hi", that's correct. The conversation remains in the KV cache as it grows gradually.
But if you're posting code, writing drafts, or even small snippets of articles, etc in there it becomes a huge problem.
- pclmulqdq 2 months ago
  
  Usually, when people think about the prompt tokens for a chat model, the initial system prompt is the vast majority of the tokens and it's the same regardless for many usage modes. You might have a slightly different system prompt for code than you have for English or for chatting, but that is 3 prompts which you can permanently put in some sort of persistent KV cache. After that, only your specific request in that mode is uncached.