Comment by jchandra
2 days ago
I’ve been exploring KV cache optimization for LLM inference.
Most methods (Top-K, sliding window) prune tokens. This works on average, but fails selectively — a few tokens cause large errors when removed.
I tried reframing the problem as approximating the attention function: Attn(Q, K, V)
Prototype: - entropy → identify weak tokens - OLS → reconstruct their contribution - SVD → compress them
Early results show lower error than Top-K at low memory, sometimes even lower memory overall.
This is still a small research prototype, would appreciate feedback or pointers to related work.
No comments yet
Contribute on Hacker News ↗