Comment by anthonypasq

1 day ago

> it is well optimized for fast inference

do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?

1 comment

anthonypasq

adrian_b 13 hours ago

The model is expected to be published today on Huggingface.co, where there should be more information.

For now, this is what NVIDIA says:

  Nemotron 3 Ultra is NVIDIA's largest open model: 550B total parameters with up to 55B active per token via a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture.

  Similar to Nemotron 3 Super, it was pre-trained using NVFP4 and shares the same core technical innovations:

    LatentMoE — Compresses tokens into a low-rank latent space before routing, enabling 4× as many expert specialists for the same inference cost.

    Multi-Token Prediction (MTP) — Predicts multiple future tokens in a single forward pass, improving chain-of-thought coherence and enabling built-in speculative decoding at inference time.

    1M Token Context Length — Mamba-2 layers provide linear-time complexity over sequence length, making 1M-token context practical for long-document and agentic workloads.