NVIDIA Nemotron 3 Super: 120B Parameters, 12B Active — the Math Behind the Fastest Open-Source Reasoning Model

NVIDIA dropped a bombshell at GTC on March 10: Nemotron 3 Super, a 120-billion-parameter open model that only activates 12 billion parameters at inference time. The result? Throughput numbers that make competing open models look sluggish — 2.2x faster than GPT-OSS-120B and 7.5x faster than Qwen3.5-122B — while scoring competitively on reasoning, coding, and long-context benchmarks.

For anyone building agentic AI systems, this is the model to watch right now.

Why the Architecture Matters

Most large language models force a brutal trade-off: more parameters mean better quality but slower, more expensive inference. Nemotron 3 Super sidesteps this by combining three architectural ideas that rarely appear together.

Mamba-2 layers handle sequential context efficiently using state-space models instead of attention, keeping memory usage linear rather than quadratic as context grows. Transformer attention layers are sprinkled in selectively where global context awareness matters most. And a novel Latent Mixture-of-Experts (LatentMoE) system routes tokens through a compressed latent space before dispatching them to specialized expert networks.

In practice, this means each token only touches 12.7 billion parameters during a forward pass, but the model draws on 120.6 billion parameters of learned capacity. The LatentMoE design projects tokens into a smaller latent dimension for routing, reducing parameter loads and communication overhead by a factor of d/ℓ. Those savings are reinvested into more total experts and more active experts per token — improving accuracy at roughly constant inference cost.

NVIDIA also pretrained the model on 25 trillion tokens using NVFP4, their 4-bit floating-point format optimized for Blackwell GPUs. The model learned to be accurate within 4-bit constraints from day one, rather than being quantized after training — a meaningful distinction that shows up in benchmark quality.

Benchmark Numbers: Strong, With Caveats

Nemotron 3 Super posts impressive numbers across several benchmarks:

SWE-Bench Verified: 60.47% — solid, though Qwen3.5-122B leads at 66.40%
RULER (1M tokens): 91.75% — matching Qwen3.5 and demolishing GPT-OSS-120B’s 22.30%
PinchBench: 85.6% across the full test suite, making it the best open model for agent-driven tasks
AIME 2025 and Terminal-Bench: Class-leading among open models in its size range

The long-context story is particularly compelling. At 256K tokens, RULER accuracy is 96.30%. At 512K, it’s 95.67%. Even at the full 1M-token context window, it holds 91.75%. GPT-OSS-120B collapses to 22.30% at the same length — a blowout that shows the Mamba architecture’s linear-memory advantage in action.

The caveat: Artificial Analysis puts Nemotron 3 Super’s overall Intelligence Index at 36, versus Qwen3.5-122B’s higher score (roughly +6 points). Qwen trades about 40% lower throughput per GPU for that quality edge. Whether that trade-off is worth it depends entirely on your workload — if you’re running multi-agent pipelines where token throughput is the bottleneck, Nemotron 3 Super’s speed advantage is massive.

Throughput and Pricing: The Real Competitive Edge

Raw speed is where Nemotron 3 Super pulls away from the pack. Independent testing from Artificial Analysis recorded 478 output tokens per second — faster than any previous open model in this class. On serverless APIs, Lightning AI hits 484 tokens/second and DeepInfra reaches 470.9 tokens/second, both with sub-second time-to-first-token.

Pricing is aggressive. DeepInfra offers the model at $0.10 per million input tokens and $0.50 per million output tokens. OpenRouter lists a free tier. Even premium providers like Lightning AI charge $0.30/$0.80 per million input/output tokens — a fraction of what proprietary models cost for comparable quality.

For agentic workloads where a single task might involve dozens of LLM calls, long context windows, and thousands of output tokens, the combination of high throughput and low cost makes Nemotron 3 Super significantly cheaper to operate than alternatives at similar quality levels.

Nemotron 3 Super vs. the Open-Source Field

The 120B open-source space is getting crowded. Here’s how Nemotron 3 Super stacks up against key competitors:

Model	Total Params	Active Params	Throughput (relative)	SWE-Bench Verified	RULER 1M	Context Window
Nemotron 3 Super	120.6B	12.7B	1x (baseline)	60.47%	91.75%	1M tokens
GPT-OSS-120B	120B	~120B	0.45x	~55%	22.30%	128K tokens
Qwen3.5-122B	122B	~10B	0.13x	66.40%	~91%	1M tokens
DeepSeek V3	671B	37B	varies	—	—	128K tokens

Nemotron 3 Super’s sweet spot is clear: it’s not always the smartest model in the room, but it delivers near-top-tier quality at throughput levels that make multi-agent systems actually practical to run. When your pipeline calls the LLM hundreds of times per task, 7.5x faster inference isn’t a nice-to-have — it’s the difference between feasible and not.

Compared to NVIDIA’s own previous Nemotron models (like Llama-3.1-Nemotron-Ultra-253B), Nemotron 3 Super offers over 5x the throughput of its predecessor while maintaining competitive accuracy — a generational leap in efficiency.

Full Open-Source, No Strings

NVIDIA released Nemotron 3 Super with open weights in BF16, FP8, and NVFP4 formats. The training dataset composition and recipes are public. The model integrates with NVIDIA’s NeMo ecosystem — NeMo Gym, NeMo RL, NeMo Data Designer, NeMo Curator, and NeMo Evaluator — providing a complete pipeline from data to deployment.

Availability is broad: build.nvidia.com, Hugging Face, OpenRouter, Perplexity, LM Studio (via Unsloth GGUFs), and cloud platforms including Google Cloud Vertex AI, Oracle Cloud Infrastructure, CoreWeave, Together AI, Baseten, Cloudflare, DeepInfra, Fireworks AI, and Modal.

For local inference, the NVFP4 quantization can run on a single NVIDIA DGX Spark — making it one of the few 120B-class models that’s even remotely approachable for on-premises deployment.

FAQ

How much does it cost to run Nemotron 3 Super via API?
Pricing varies by provider. DeepInfra offers it at $0.10 per million input tokens and $0.50 per million output tokens. OpenRouter has a free tier. Premium providers charge around $0.30/$0.80 per million tokens. Self-hosting on NVIDIA hardware with NVFP4 quantization is also an option.

How does Nemotron 3 Super compare to Qwen3.5-122B?
Qwen3.5-122B scores higher on pure intelligence benchmarks (about 6 points higher on Artificial Analysis’s Intelligence Index) and leads on SWE-Bench Verified (66.40% vs 60.47%). But Nemotron 3 Super delivers 7.5x higher throughput per GPU, matches Qwen on long-context tasks (RULER 1M), and costs less to run. Choose Qwen for maximum single-response quality; choose Nemotron for high-volume agentic workflows.

Can I run Nemotron 3 Super locally?
Yes, with the right hardware. The NVFP4 quantization fits on a DGX Spark. FP8 and BF16 variants require more VRAM. Community GGUF quantizations from Unsloth and LM Studio are available on Hugging Face for various configurations.

What makes this model good for agentic AI?
Three things: the 1M-token native context window gives agents long-term memory across extended tasks; the high throughput (478+ tokens/second) keeps multi-step agent loops responsive; and the LatentMoE architecture maintains quality while keeping per-call costs low enough to handle dozens of LLM calls per task.

Is Nemotron 3 Super truly open source?
NVIDIA released the model weights, training dataset composition, and recipes publicly. It’s available under an open license on Hugging Face. The full training and deployment pipeline integrates with NVIDIA’s open NeMo tools. By current open-source AI standards, this is about as open as it gets for a model of this scale.

Top AI Product