Comment by om8

3 hours ago

https://docs.vllm.ai/en/v0.20.0/api/vllm/model_executor/laye...

`vllm.model_executor.layers.quantization.turboquant`

> The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).

2 comments

om8

amitport 26 minutes ago

Those works did cite DRIVE/EDEN :)

HIGGS is an extension of EDEN (using the well known method for blockwise Lloyd-Max).

The proper framing of this "TurboQuant" layer in vllm (which does not include JQL) is precisely EDEN 22 without the scale correction.