Comment by adrian_b
16 hours ago
The model is expected to be published today on Huggingface.co, where there should be more information.
For now, this is what NVIDIA says:
Nemotron 3 Ultra is NVIDIA's largest open model: 550B total parameters with up to 55B active per token via a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture.
Similar to Nemotron 3 Super, it was pre-trained using NVFP4 and shares the same core technical innovations:
LatentMoE — Compresses tokens into a low-rank latent space before routing, enabling 4× as many expert specialists for the same inference cost.
Multi-Token Prediction (MTP) — Predicts multiple future tokens in a single forward pass, improving chain-of-thought coherence and enabling built-in speculative decoding at inference time.
1M Token Context Length — Mamba-2 layers provide linear-time complexity over sequence length, making 1M-token context practical for long-document and agentic workloads.
No comments yet
Contribute on Hacker News ↗