Comment by 0xbadcafebee
5 hours ago
Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.
Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)
No comments yet
Contribute on Hacker News ↗