Comment by taykolasinski

3 days ago

OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).

Two key takeaways from the reproduction:

Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).

I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.

This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.

How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)."?

  • I’m referring specifically to the fundamental residual connection backbone that defines the transformer architecture (x_{l+1} = x_l + F(x_l)).

    While the sub-modules differ (MHA vs GQA, SwiGLU vs GeLU, Mixture-of-Depths, etc.), the core signal propagation in Llama, Gemini, and Claude relies on that additive residual stream.

    My point here is that DeepSeek's mHC challenges that fundamental additive assumption by introducing learnable weighted scaling factors to the residual path itself.

    • I guess I am asking how we know Gemini and Claude relies on the additive residual stream. We don't know the architecture details for these closed models?

      1 reply →