← Back to context Comment by whimsicalism 4 days ago Makes sense, I think streaming audio->audio inference is a relatively big lift. 2 comments whimsicalism Reply red2awn 4 days ago Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server. whimsicalism 3 days ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
red2awn 4 days ago Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server. whimsicalism 3 days ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
whimsicalism 3 days ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.
I imagine you have to start decoding many speculative completions in parallel to have true low latency.