Comment by solarkraft

1 month ago

I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?

6 comments

solarkraft

kevmo314 1 month ago

It's an incremental improvement, not really a revolutionary step.

That being said, I think one could adapt an existing model to add mHC by initializing the routing matrix to the regular residual connection and then post-train the hyper connection matrices. This would let you continue training more efficiently on existing models.

taykolasinski 1 month ago
That initialization strategy (effectively starting as identity to match the standard residual stream) is clever. It would let you surgery an existing model like Llama-3 and fine-tune it into an mHC architecture.
The main risk I see is that the 7x signal amplification happens very aggressively. Even with a gentle initialization, you’d likely need very strict gradient clipping or a tiny learning rate on those new routing matrices to prevent them from blowing up the pre-trained features in the first few steps.
Also, I think there's a mix-up here between mHC (this paper, expressivity) and MLA (latent attention, which provides the massive context efficiency). mHC doesn't save memory, but it might make the model 'smarter' per parameter.
- solarkraft 1 month ago
  
  You’re right, I totally mixed this up with MLA.

graemefawcett 1 month ago

I think the biggest benefit is bandwidth more so than efficiency. This gives you multiple streams to mux which and a means to control their mixing.

The biggest innovation I think may have been accidental. The doubly stochastic matrix implements conservation on the signal stream.

Treating the signal like the information it is as we do in any other domain is crucial for maintaining its coherence. We don't allow a network router to generate more packets than it receives for the same reason.

yorwba 1 month ago

https://arxiv.org/abs/2512.24880 was published less than two weeks ago, which should explain why it's not more common yet. And it's not that amazing either. It's a slight quality improvement for a slight increase in cost. It's not even clear to me whether it pays for itself.

solarkraft 1 month ago

My bad, I took this as something Multi-head Latent Attention (MLA) related.