Comment by in-silico
12 hours ago
Modern kv caches can contain up to 1 million tokens (~3000 pages of text). It's not that short, it's like 48 straight hours of reading.
12 hours ago
Modern kv caches can contain up to 1 million tokens (~3000 pages of text). It's not that short, it's like 48 straight hours of reading.
Yes and no, it's not just text, it's images, video, etc, and it's not just the pages of content, it's also all the "thinking" as well. Plus the models tend to work better earlier on in the context.
I regularly get close to filling up context windows and have to compact the context. I can do this several times in one human session of me working on a problem, which you could argue is roughly my own context window.
My point though was that almost nothing of the model's knowledge is in the context, it's all in the training. We have no functional long term memory for LLMs beyond training.
The KV cache isn't memory, it's the extent of the process saved so the inference can start where the last generated output is concatenated with the next input. It's entirely about saving compute and has nothing to do with memory.
This really confuses how stupid LLMs are: they're just text logs as output and text logs as input; hence the goblins are just tokens that seem to problematically be more probable in the output.
But the KV cache is a thing made to keep a session from having to run through the entire inference. The only thing you can call "memory" is there's no random perturbations in the KV cache while there may be in re=running chat which ends up being non-deterministic. You can think of it as a deterministic seed to prevent a random conversation from it's normal non-deterministic output