Comment by iidsample
19 hours ago
We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
No comments yet
Contribute on Hacker News ↗