Comment by scrlk
12 hours ago
IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.
12 hours ago
IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.
The official API is FP8, which should imply that it's lossless.
>unquantised -> FP8 is pretty much lossless
Claude Shannon is rolling in his grave.
I don't know, sounds quite similar to his rate distortion theorem (analyzing minimum number of bits/symbol you need to stay under some fixed amount of distortion). I.e. lossy compression with a maximum amount of loss. I.e. "pretty much lossless" compression.
https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory
Do infra providers reveal that level of implementation detail?
I've seen a few articles from providers talking about KV cache quantisation, but it's not something they explicitly point out like they do with weights.
So you could end up paying more for unquantised weights, only to get silently hit with a quantised KV cache...