Comment by zozbot234

8 hours ago

> RAM use also increases with context window size.

KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write out as much as llm_active_size per token, which is way too much at scale!), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.

Make sure also that you're using mmap to load model parameters, especially for MoE experts. It has no detrimental effect on performance given that you have enough RAM to begin with, but it allows you to scale up gradually beyond that, at a very limited initial cost (you're only replacing a fraction of your memory_bandwidth with much lower storage_bandwidth).

1 comment

zozbot234

0xbadcafebee 1 hour ago

Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.

Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)