Comment by vivahir215

2 days ago

Interesting Approach. Curious about the latency tradeoff: OLS + SVD are much heavier than Top-K.Have you benchmarked end-to-end inference latency?

3 comments

vivahir215

scythmic_waves 6 hours ago

From the conclusion:

> The primary trade-off observed is the increased calculation time for OLS and SVD steps. Consequently, the next phase of this work involves implementing these operations within custom Triton kernels to amortize latency. By viewing the cache through the lens of reconstruction fidelity rather than just memory capacity, we can develop more sustainable architectures for long-context inference.

Reading between the lines, the increase in latency was so significant that they didn't want to include it before they had a chance to try and optimize the problem away first.

Still interesting research. Hope they get good results!

jchandra 5 hours ago

Haha, that’s a very fair reading :)
Yeah, the latency hit is definitely real. That said, most of what I’ve run so far is CPU-bound, which likely exaggerates it quite a bit so I didn’t want to draw strong conclusions from that.
Would need proper GPU implementations to really understand where it lands.

jchandra 2 days ago

In this prototype, OLS + SVD isn’t per-token, it runs only when the recycle bin fills (amortized over multiple tokens).

That said, it’s still heavier than Top-K. I haven’t benchmarked end-to-end latency yet; this is mainly exploring the accuracy vs memory tradeoff.