Key points to remember:
- Nvidia has released Nemotron 3 Super, a 120 B parameter open MoE model enabling only 12.7 B parameters per forward pass.
- Nemotron 3 Super delivers up to 7.5x higher throughput than Qwen3.5-122B-A10B in agent workloads on settings of 8,000 inputs/64,000 outputs.
- The model is fully open sourced under the Nvidia Nemotron Open Model license, with checkpoints and training data on Hugging Face.
Nvidia Launches Nemotron 3 Super With 7.5x Throughput Gains Over Qwen3.5-122B
The latest Nvidia model only activates 12.7 billion parameters per pass through using a Mixture-of-Experts (MoE) architecture, meaning most of its weight remains unused during inference. This design choice directly targets two issues developers face when deploying multi-stage AI agents: the added cost of extended reasoning chains and the increasing token usage that can multiply up to 15 times in multi-agent pipelines.
Nemotron 3 Super is the second model in Nvidia’s Nemotron 3 family, after the Nemotron 3 Nano from December 2025. Nvidia has announced the release around March 10, 2026.
The model uses a hybrid Mamba-Transformer skeleton on 88 layers. Mamba-2 blocks handle long sequences with linear timing efficiency, while Transformer attention layers preserve precise recall. This combination gives the model native support for pop-ups up to a million tokens without the memory penalties typical of purely attentional designs.
Nvidia has also integrated a LatentMoE routing system that compresses token embeddings into low-rank space before sending them to 512 experts per layer, activating 22 at a time. The company says this allows for approximately four times more experts for the same inference cost compared to standard MoE approaches, and allows for finer task specialization, such as separating Python logic from SQL management at the expert level.

Multi-token prediction layers, using two heads of shared weight, accelerate thought chain generation and enable native speculative decoding. On structured tasks, Nvidia reports up to three times faster generation.
The model was pre-trained on 25 trillion tokens in two phases. The first phase used 20 trillion large data tokens. The second used five trillion high-quality tokens optimized for benchmark performance. A final expansion phase on 51 billion tokens extended the native context to one million tokens. Post-training included supervised fine-tuning on approximately seven million samples and reinforcement learning in 21 environments with over 1.2 million deployments.
In benchmarks, Nemotron 3 Super scored 83.73 on MMLU-Pro, 90.21 on AIME25 and 60.47 on SWE-Bench using OpenHands. On PinchBench, it achieved 85.6%, the highest score among open models in its category. In a long context assessment, he scored 91.64 on RULER 1M.
Compared to GPT-OSS-120B, Nemotron 3 Super delivers 2.2x higher throughput at 8k input and 64k output. Against Qwen3.5-122B-A10B, this figure reaches 7.5 times. Nvidia also reports more than five times higher throughput and up to two times higher accuracy than the previous generation Nemotron Super.
Nvidia trained the model end-to-end in its NVFP4 four-bit floating point format, optimized for Blackwell GPUs. On the B200 hardware, Nvidia claims that inference runs up to four times faster than FP8 on the H100, with no reported loss of accuracy. FP8 and NVFP4 quantized checkpoints maintain 99.8% or greater total accuracy.
The model also powers the Nvidia AI-Q search agent, which reached the top spot in the Deepresearch Bench rankings.
Nemotron 3 Super is fully open sourced under the Nvidia Nemotron Open Model license. Checkpoints in BF16, FP8, and NVFP4 formats, as well as pre-training data, post-training samples, and reinforcement learning environments, are available on Hugging Face. Inference is supported through Nvidia NIM, build.nvidia.com, Perplexity, Openrouter, Together AI, Google Cloud, AWS, Azure, and Coreweave, with on-premises options through Dell Enterprise Hub and HPE.
Developers can access training recipes, fine-tuning guides, and inference cookbooks through the NeMo platform using vLLM, SGLang, and TensorRT-LLM.


