Comment by NitpickLawyer

6 hours ago

It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.

2 comments

NitpickLawyer

mudkipdev 5 hours ago

Reminds me of the possibility of running DeepSeek at 3-4 t/s with SSD streaming, could be viable if you are running something overnight for example

zozbot234 3 hours ago

The nice thing about DeepSeek and off-memory streaming is that you ought to be able to batch multiple sessions of it in parallel. Each individual session would slow down from streaming incrementally more active weights from disk, but your total tok/s would ultimately only be limited by compute. Other models have trouble doing this, because the KV cache takes too much space in RAM (and increases wear-and-tear if stored on disk) even for somewhat limited context.