Comment by omneity
5 hours ago
You can increase the context window beyond its max trained context using RoPE scaling[0] which will require more VRAM.
But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].
0: https://medium.com/@leannetan/extending-context-length-with-...
1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...
No comments yet
Contribute on Hacker News ↗