← Back to context

Comment by taykolasinski

3 days ago

That initialization strategy (effectively starting as identity to match the standard residual stream) is clever. It would let you surgery an existing model like Llama-3 and fine-tune it into an mHC architecture.

The main risk I see is that the 7x signal amplification happens very aggressively. Even with a gentle initialization, you’d likely need very strict gradient clipping or a tiny learning rate on those new routing matrices to prevent them from blowing up the pre-trained features in the first few steps.

Also, I think there's a mix-up here between mHC (this paper, expressivity) and MLA (latent attention, which provides the massive context efficiency). mHC doesn't save memory, but it might make the model 'smarter' per parameter.