← Back to context

Comment by anthonypasq

1 day ago

> it is well optimized for fast inference

do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?

The model is expected to be published today on Huggingface.co, where there should be more information.

For now, this is what NVIDIA says:

  Nemotron 3 Ultra is NVIDIA's largest open model: 550B total parameters with up to 55B active per token via a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture.

  Similar to Nemotron 3 Super, it was pre-trained using NVFP4 and shares the same core technical innovations:

    LatentMoE — Compresses tokens into a low-rank latent space before routing, enabling 4× as many expert specialists for the same inference cost.

    Multi-Token Prediction (MTP) — Predicts multiple future tokens in a single forward pass, improving chain-of-thought coherence and enabling built-in speculative decoding at inference time.

    1M Token Context Length — Mamba-2 layers provide linear-time complexity over sequence length, making 1M-token context practical for long-document and agentic workloads.