Tomorrow NVIDIA will publish Nemotron 3 Ultra, which will be the biggest open weights LLM from a US company (550B parameters).
The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.
While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.
In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.
With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.
NVIDIA seem to be following a smart Intel-like strategy of selling chips and also creating software that helps create demand for those chips. With Intel it was things like MKL, IPP, OpenCV etc, and with NVIDIA it is not just CUDA and development libraries but also models like Nemotron.
The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.
do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?
The model is expected to be published today on Huggingface.co, where there should be more information.
For now, this is what NVIDIA says:
Nemotron 3 Ultra is NVIDIA's largest open model: 550B total parameters with up to 55B active per token via a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture.
Similar to Nemotron 3 Super, it was pre-trained using NVFP4 and shares the same core technical innovations:
LatentMoE — Compresses tokens into a low-rank latent space before routing, enabling 4× as many expert specialists for the same inference cost.
Multi-Token Prediction (MTP) — Predicts multiple future tokens in a single forward pass, improving chain-of-thought coherence and enabling built-in speculative decoding at inference time.
1M Token Context Length — Mamba-2 layers provide linear-time complexity over sequence length, making 1M-token context practical for long-document and agentic workloads.
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
Tomorrow NVIDIA will publish Nemotron 3 Ultra, which will be the biggest open weights LLM from a US company (550B parameters).
The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.
While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.
In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.
With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.
NVIDIA seem to be following a smart Intel-like strategy of selling chips and also creating software that helps create demand for those chips. With Intel it was things like MKL, IPP, OpenCV etc, and with NVIDIA it is not just CUDA and development libraries but also models like Nemotron.
The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.
> it is well optimized for fast inference
do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?
The model is expected to be published today on Huggingface.co, where there should be more information.
For now, this is what NVIDIA says:
:(
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"