Comment by whimsicalism 2 months ago Makes sense, I think streaming audio->audio inference is a relatively big lift. 2 comments whimsicalism Reply red2awn 2 months ago Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server. whimsicalism 2 months ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
red2awn 2 months ago Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server. whimsicalism 2 months ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
whimsicalism 2 months ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.
I imagine you have to start decoding many speculative completions in parallel to have true low latency.