Mistral Small 4 Packs 119B Parameters Into 6B Active — and It Does Everything

One model to replace four. That’s the pitch behind Mistral Small 4, released on March 16 during NVIDIA GTC 2026. Where Mistral previously asked developers to pick between Mistral Small for instructions, Magistral for reasoning, Pixtral for vision, and Devstral for coding agents, Small 4 rolls all four capabilities into a single 119B-parameter Mixture-of-Experts architecture — with only 6B parameters active per token. It’s open-source under Apache 2.0, and Mistral claims it matches or beats GPT-OSS 120B on multiple benchmarks while generating significantly shorter outputs.

The timing is deliberate. Dropping the model mid-GTC, alongside an NVFP4-quantized variant optimized for NVIDIA hardware, signals that Mistral is positioning itself as the default open-source option for enterprise GPU deployments. But does the “one model to rule them all” strategy actually hold up?

How the Architecture Works: 128 Experts, 4 Active

Mistral Small 4 uses a Mixture-of-Experts design with 128 total experts, activating just 4 per token. The total parameter count is 119B, but the effective compute cost per token is roughly 6B parameters (8B if you count embedding and output layers). This is the same architectural philosophy behind earlier Mixtral models, scaled up considerably.

The practical result: you get a model that punches well above its active parameter weight class. On LiveCodeBench, Mistral Small 4 outperforms GPT-OSS 120B — a dense model with nearly 20x more active parameters. On the Artificial Analysis LCR benchmark, it scores 0.72 with just 1.6K characters of output, while competing models like Qwen need 5.8K–6.1K characters to reach comparable performance. That’s 3.5–4x more verbose for similar quality.

The 256K context window is another notable spec. Combined with multimodal input support (text and images), this positions the model for document-heavy enterprise workflows — parsing contracts, analyzing charts, summarizing long reports — without needing to chunk inputs aggressively.

Reasoning on a Dial: The `reasoning_effort` Parameter

Perhaps the most interesting design choice is the configurable reasoning_effort parameter. Set it to "none" and you get fast, direct responses comparable to Mistral Small 3.2. Crank it to "high" and the model switches into extended chain-of-thought mode equivalent to Magistral’s deep reasoning.

This matters because it eliminates a real operational headache. Previously, teams running Mistral models had to maintain separate deployments for quick instruction-following tasks and complex reasoning problems — different models, different endpoints, different latency profiles. Small 4 collapses that into a single deployment with a runtime toggle.

Mistral reports 40% lower end-to-end latency in latency-optimized configurations and 3x higher throughput compared to Mistral Small 3. For teams running inference at scale, consolidating four model deployments into one while tripling throughput is a meaningful infrastructure simplification.

Where It Stands Against the Competition

The open-source model landscape in early 2026 is crowded. Qwen3 models from Alibaba, Meta’s Llama series, and Google’s Gemma variants all compete for the same developer mindshare. Here’s how Mistral Small 4 stacks up:

vs. GPT-OSS 120B: Mistral Small 4 matches or exceeds GPT-OSS 120B across benchmarks while producing 20% less output — meaning faster responses and lower token costs. The MoE architecture gives it a significant efficiency edge over dense models at similar parameter counts.

vs. Qwen3 series: Qwen models remain strong contenders, especially for multilingual tasks and code generation. But the token efficiency gap is stark — Qwen needs 3.5–4x more output tokens for comparable quality on reasoning tasks, which translates directly into higher API costs and slower responses.

vs. Llama 3.3 70B: Earlier Mistral Small 3 was already competitive with Llama 3.3 70B while being 3x faster on the same hardware. Small 4 extends this advantage with the added multimodal and reasoning capabilities that Llama models don’t natively offer in a single package.

vs. Specialized models: Mistral’s own Devstral Small 2 still scores higher on pure coding benchmarks (68% on SWE-bench Verified), and dedicated vision models may outperform Small 4 on specific image tasks. The tradeoff is clear: Small 4 trades peak specialist performance for breadth and deployment simplicity.

Pricing, Hardware, and How to Run It

API pricing: Through Mistral’s API (mistral-small-latest), the model costs $0.20 per million input tokens and $0.60 per million output tokens. Given its token efficiency — producing shorter outputs for equivalent quality — the effective cost per task may be lower than competing models with cheaper per-token rates but higher verbosity.

Self-hosting requirements: This is where things get real. Mistral recommends 4x NVIDIA HGX H100, 4x HGX H200, or 2x DGX B200 for optimal performance. The minimum viable setup is 4x HGX H100, 2x HGX H200, or 1x DGX B200. This isn’t a model you’ll run on a gaming GPU.

However, the NVFP4-quantized variant (Mistral-Small-4-119B-2603-NVFP4) released on Hugging Face is designed for NVIDIA’s latest hardware with reduced memory footprint. Community members on NVIDIA’s developer forums have noted this as “a pleasant surprise,” with expectations that NVFP4 support and optimization will improve over time.

The model is available through vLLM, SGLang, llama.cpp, and Hugging Face Transformers, plus NVIDIA NIM for production deployment and NeMo for fine-tuning.

The Bigger Picture: Consolidation as Strategy

Mistral’s move reflects a broader industry trend: model consolidation. Rather than maintaining a portfolio of specialists, companies are building unified models that handle multiple modalities and task types. OpenAI did this with GPT-4o. Google did it with Gemini. Mistral is now doing it at the open-source tier.

The strategic bet is that developers and enterprises would rather manage one model that’s good at everything than juggle four models that are each great at one thing. Simon Willison, who tested the model on release day, noted the unified approach while observing that the image generation capabilities still have room to grow — his test prompt produced a “mangled” bicycle and a pelican rendered as “a series of grey curves with a triangular beak.”

That’s an honest data point. Small 4 likely won’t replace dedicated vision or coding models for teams with extreme performance requirements in a single domain. But for the vast majority of use cases — chat, document understanding, code assistance, light reasoning — a single Apache 2.0 model that handles all of them competently is a compelling proposition.

FAQ

What is Mistral Small 4?
Mistral Small 4 is a 119B-parameter Mixture-of-Experts language model from Mistral AI that unifies instruction-following, reasoning, multimodal understanding, and code agent capabilities into a single open-source model. It activates only 6B parameters per token, making it efficient despite its large total size.

Is Mistral Small 4 free to use?
The model weights are released under Apache 2.0, so self-hosting is free. API access through Mistral’s platform costs $0.20 per million input tokens and $0.60 per million output tokens. Free prototyping is available through NVIDIA’s build.nvidia.com.

Can I run Mistral Small 4 locally?
Technically yes, but you need serious hardware. The minimum setup requires multiple enterprise-grade GPUs (4x H100 or 2x H200). The NVFP4-quantized version reduces memory requirements for NVIDIA’s latest hardware, but this is still far beyond consumer-grade setups. For most developers, the API or cloud deployment is the practical path.

How does Mistral Small 4 compare to GPT-4o or Claude?
Mistral Small 4 targets a different segment — it’s an open-source, self-hostable model competing primarily with other open models like Qwen3, Llama, and GPT-OSS. It matches GPT-OSS 120B on benchmarks while being more token-efficient. Direct comparisons with proprietary frontier models like GPT-4o or Claude aren’t the intended use case, though the reasoning_effort parameter brings it closer to that tier for specific tasks.

What’s the difference between Mistral Small 4 and Devstral?
Devstral is Mistral’s specialized coding agent model that scores higher on pure software engineering benchmarks like SWE-bench. Mistral Small 4 incorporates Devstral’s coding capabilities alongside reasoning, vision, and instruction-following in a single model. If coding is your only use case, Devstral may still be the better choice. If you need a generalist, Small 4 is designed to be that.

Top AI Product

Leave a comment Cancel reply