Comment by nodja

6 days ago

Yeah chatgpt pretty much nailed it.

But you still have to load the data for each request. And in an LLM doesnt this mean the WHOLE kv cache because the kv cache changes after every computation? So why isnt THIS the bottleneck? Gemini is talking about a context window of a million tokens- how big would the kv cache fir this get?