Comment by zozbot234
4 hours ago
That's pretty nice actually, how much KV cache does that model require at full context? That tends to be the main limit to running concurrent requests locally, there's KV quantization but it has outsized negative impact on model quality.
No comments yet
Contribute on Hacker News ↗