What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes.
Do you have any evidence to support this statement? Or are you imagining some not yet invented algorithms running on some not yet invented hardware?
Sparse matrices can increase in dimension while keeping the same number of non-zeroes, that part is self evident. Sparse weights models can be trained, you probably are already aware of RigL and SRigL, there is similar other related work on unstructured and structured sparse training. You could argue that those adapt their algorithm to be executable on GPUs and that none are training at x100 or x1000 dimensions. Yes, that is the part that requires access to sparse compute hardware acceleration, which exists as prototypes [1] or are extremely expensive (Cerebras).
All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.
Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.
From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.
EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:
To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).
Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).
### 1\. The Representation: Hyperdimensional Computing (HDC)
Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters.
To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.
* **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories.
* **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations:
* **Binding (Association):** XOR operations (`A ⊕ B`).
* **Bundling (Superposition):** Majority rule (voting).
* **Permutation:** Bit shifting.
* **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
### 2\. The Architecture: "Spiking" Attention Mechanisms
Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.
* **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection.
* **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function.
* *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
### 3\. The Hardware: Neuromorphic Substrate
Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).
* **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address.
* **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
### Summary: The Hypothetical "Spiking HD-Transformer"
I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.
Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.
Before we had proper GPUs everyone said the same thing about Neural Networks.
Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.
There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.
Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.
What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes.
Do you have any evidence to support this statement? Or are you imagining some not yet invented algorithms running on some not yet invented hardware?
Sparse matrices can increase in dimension while keeping the same number of non-zeroes, that part is self evident. Sparse weights models can be trained, you probably are already aware of RigL and SRigL, there is similar other related work on unstructured and structured sparse training. You could argue that those adapt their algorithm to be executable on GPUs and that none are training at x100 or x1000 dimensions. Yes, that is the part that requires access to sparse compute hardware acceleration, which exists as prototypes [1] or are extremely expensive (Cerebras).
[1] https://dl.acm.org/doi/10.1109/MM.2023.3295848
2 replies →
All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.
Examples: - GPT OSS 120b - Kimi K2 - DeepSeek R1
Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
1 reply →
Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.
From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.
https://transformer-circuits.pub/2022/toy_model/index.html
https://arxiv.org/abs/1803.03635
EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:
To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).
Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).
### 1\. The Representation: Hyperdimensional Computing (HDC)
Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters. To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.
### 2\. The Architecture: "Spiking" Attention Mechanisms
Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.
### 3\. The Hardware: Neuromorphic Substrate
Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).
### Summary: The Hypothetical "Spiking HD-Transformer"
| Feature | Standard Transformer | Simulated "Brain-Like" Transformer | | :--- | :--- | :--- | | *Dimension* | Low (\~4k), Dense, Float32 | *Ultra-High* (\~100k), Sparse, Binary | | *Operation* | Matrix Multiplication (MACs) | *Bitwise XOR / Popcount* | | *Attention* | Global Softmax ($N^2$) | *Spiking k-Winner-Take-All* (Linear) | | *Activation* | Continuous (RELU/GELU) | *Discrete Spikes* (Fire-or-Silence) | | *Hardware* | GPU (Synchronous) | *Neuromorphic* (Asynchronous) |
I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.
Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.
Before we had proper GPUs everyone said the same thing about Neural Networks.
Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.
There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.
Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.
Plenty of mech interp work shows that models are forced to commingle different concepts to fit them into the "low" dimensional vector space. (https://www.neelnanda.io/mechanistic-interpretability/glossa...)
https://arxiv.org/abs/2210.06313
https://arxiv.org/abs/2305.01610
2 replies →