Comment by GaggiX
8 hours ago
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
8 hours ago
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
No comments yet
Contribute on Hacker News ↗