Comment by GaggiX
10 hours ago
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
10 hours ago
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
No comments yet
Contribute on Hacker News ↗