← Back to context Comment by shdudns 1 day ago Sorry, I'm a bit of a noob on llm. What is "prefill"? As opposed to what? 2 comments shdudns Reply natechlin 1 day ago Prefill - module computes KV cache over input toks, up to the last token in your input (the 'prompt'), at which point it can then begin -Decode - the model chooses a new token to append to the end of the current token list (i.e. it generates a token), then computes the new tokens KVs.Decode is basically prefill 1 tok -> add 1 tok -> prefill 1 more tok -> ....but in the initial prefill stage it doesn't need to do generation, since you've provided the toks. ghm2199 1 day ago And Incidentally prefill would also be how caching,say, a system prompt saves you some $ for API usage with LLM providers. They only compute the kv cache for the new tokens after the system prompt.
natechlin 1 day ago Prefill - module computes KV cache over input toks, up to the last token in your input (the 'prompt'), at which point it can then begin -Decode - the model chooses a new token to append to the end of the current token list (i.e. it generates a token), then computes the new tokens KVs.Decode is basically prefill 1 tok -> add 1 tok -> prefill 1 more tok -> ....but in the initial prefill stage it doesn't need to do generation, since you've provided the toks. ghm2199 1 day ago And Incidentally prefill would also be how caching,say, a system prompt saves you some $ for API usage with LLM providers. They only compute the kv cache for the new tokens after the system prompt.
ghm2199 1 day ago And Incidentally prefill would also be how caching,say, a system prompt saves you some $ for API usage with LLM providers. They only compute the kv cache for the new tokens after the system prompt.
Prefill - module computes KV cache over input toks, up to the last token in your input (the 'prompt'), at which point it can then begin -
Decode - the model chooses a new token to append to the end of the current token list (i.e. it generates a token), then computes the new tokens KVs.
Decode is basically prefill 1 tok -> add 1 tok -> prefill 1 more tok -> ....
but in the initial prefill stage it doesn't need to do generation, since you've provided the toks.
And Incidentally prefill would also be how caching,say, a system prompt saves you some $ for API usage with LLM providers. They only compute the kv cache for the new tokens after the system prompt.