← Back to context

Comment by p1esk

6 days ago

Deep Learning models would work way better with much higher dimensional sparse vectors

Citations?

There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.

The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.

  • If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.

  • I could not find any evidence that sparse models work better than dense models.

    • What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.

      The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.

      4 replies →

    • All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.

      Examples: - GPT OSS 120b - Kimi K2 - DeepSeek R1

      5 replies →

    • https://transformer-circuits.pub/2022/toy_model/index.html

      https://arxiv.org/abs/1803.03635

      EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:

      To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).

      Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).

      ### 1\. The Representation: Hyperdimensional Computing (HDC)

      Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters. To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.

        * **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories.
        * **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations:
            * **Binding (Association):** XOR operations (`A ⊕ B`).
            * **Bundling (Superposition):** Majority rule (voting).
            * **Permutation:** Bit shifting.
        * **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
      

      ### 2\. The Architecture: "Spiking" Attention Mechanisms

      Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.

        * **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection.
        * **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function.
            * *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
      

      ### 3\. The Hardware: Neuromorphic Substrate

      Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).

        * **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address.
        * **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
      

      ### Summary: The Hypothetical "Spiking HD-Transformer"

      | Feature | Standard Transformer | Simulated "Brain-Like" Transformer | | :--- | :--- | :--- | | *Dimension* | Low (\~4k), Dense, Float32 | *Ultra-High* (\~100k), Sparse, Binary | | *Operation* | Matrix Multiplication (MACs) | *Bitwise XOR / Popcount* | | *Attention* | Global Softmax ($N^2$) | *Spiking k-Winner-Take-All* (Linear) | | *Activation* | Continuous (RELU/GELU) | *Discrete Spikes* (Fire-or-Silence) | | *Hardware* | GPU (Synchronous) | *Neuromorphic* (Asynchronous) |

      4 replies →