Comment by vivahir215
2 days ago
Interesting Approach. Curious about the latency tradeoff: OLS + SVD are much heavier than Top-K.Have you benchmarked end-to-end inference latency?
2 days ago
Interesting Approach. Curious about the latency tradeoff: OLS + SVD are much heavier than Top-K.Have you benchmarked end-to-end inference latency?
From the conclusion:
> The primary trade-off observed is the increased calculation time for OLS and SVD steps. Consequently, the next phase of this work involves implementing these operations within custom Triton kernels to amortize latency. By viewing the cache through the lens of reconstruction fidelity rather than just memory capacity, we can develop more sustainable architectures for long-context inference.
Reading between the lines, the increase in latency was so significant that they didn't want to include it before they had a chance to try and optimize the problem away first.
Still interesting research. Hope they get good results!
Haha, that’s a very fair reading :)
Yeah, the latency hit is definitely real. That said, most of what I’ve run so far is CPU-bound, which likely exaggerates it quite a bit so I didn’t want to draw strong conclusions from that.
Would need proper GPU implementations to really understand where it lands.
In this prototype, OLS + SVD isn’t per-token, it runs only when the recycle bin fills (amortized over multiple tokens).
That said, it’s still heavier than Top-K. I haven’t benchmarked end-to-end latency yet; this is mainly exploring the accuracy vs memory tradeoff.