Comment by tripplyons
6 days ago
Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
“Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.
Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.