Comment by Aurornis 8 hours ago Unfortunately not with a reasonable context length. 3 comments Aurornis Reply regularfry 1 hour ago I've got 139k context with the UD-Q4_K_XL on a 4090, q8_0 ctk/v. Could probably squeeze a little more but that's enough for me for the moment. kkzz99 7 hours ago It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090. GaggiX 6 hours ago The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
regularfry 1 hour ago I've got 139k context with the UD-Q4_K_XL on a 4090, q8_0 ctk/v. Could probably squeeze a little more but that's enough for me for the moment.
kkzz99 7 hours ago It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.
GaggiX 6 hours ago The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
I've got 139k context with the UD-Q4_K_XL on a 4090, q8_0 ctk/v. Could probably squeeze a little more but that's enough for me for the moment.
It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.