Comment by jchandra

2 days ago

I’ve been exploring KV cache optimization for LLM inference.

Most methods (Top-K, sliding window) prune tokens. This works on average, but fails selectively — a few tokens cause large errors when removed.

I tried reframing the problem as approximating the attention function: Attn(Q, K, V)

Prototype: - entropy → identify weak tokens - OLS → reconstruct their contribution - SVD → compress them

Early results show lower error than Top-K at low memory, sometimes even lower memory overall.

This is still a small research prototype, would appreciate feedback or pointers to related work.