Leanstral Uses 6B Active Parameters to Beat Models 100x Its Size at Formal Proofs

Formal verification — the practice of mathematically proving that software or theorems are correct — has long been the domain of specialists willing to wrestle with arcane proof assistants. On March 16, 2026, Mistral AI dropped a model into that world that nobody expected: Leanstral, a 119B-parameter sparse mixture-of-experts model that activates only 6.5B parameters per token, yet outperforms open-source models with 744B parameters on formal proof benchmarks.

The Hacker News thread hit 352 points and 70 comments within hours. But the community reaction wasn’t just applause — it sparked a genuinely interesting debate about what matters more in formal verification: raw accuracy or cost-efficiency.

Why Lean 4 Is Suddenly Everywhere

Before diving into Leanstral itself, it’s worth understanding the moment it’s arriving into.

Lean 4 is a proof assistant and programming language that lets you express complex mathematical objects (like perfectoid spaces) and software specifications (like properties of Rust programs) in a way that a computer can mechanically verify. If your proof compiles, it’s correct. No edge cases, no flaky tests, no “works on my machine.”

This guarantee has made Lean 4 increasingly attractive beyond academia. Google DeepMind used Lean 4 as the backbone for AlphaProof, the system that achieved silver-medal performance at the International Mathematical Olympiad. DeepSeek built its Prover-V2 on top of Lean 4, hitting 88.9% on the MiniF2F benchmark. Companies exploring “verified AI” — where AI-generated code comes with mathematical correctness guarantees — are watching this space closely.

The problem? Writing Lean 4 proofs is hard. Really hard. The learning curve is steep, the tooling is sparse, and even experienced developers can spend hours on proofs that feel like they should be simple. That’s exactly the gap Leanstral is designed to fill.

What Leanstral Actually Is (and Isn’t)

Leanstral isn’t a general-purpose coding assistant that happens to know some Lean 4. It’s the first open-source agent built specifically for proof engineering workflows in real codebases.

The technical specs tell an interesting story:

Total parameters: 119B across 128 experts
Active parameters per token: 6.5B (only 4 experts fire at a time)
Context window: 256K tokens
License: Apache 2.0
Input: Text and images
Base: Mistral Small 4 family

The sparse MoE architecture is the key design choice here. Instead of running every parameter for every token (like a dense model), Leanstral routes each token to just 4 of its 128 expert modules. This means you get the knowledge capacity of a 119B model at the inference cost of a ~6.5B model.

Mistral also trained Leanstral specifically with tool-calling capabilities for lean-lsp-mcp — the Model Context Protocol server for Lean’s language server. This means the agent can interact with Lean’s compiler directly: checking types, running tactics, inspecting error messages, and iterating on proofs in a loop. It’s not just generating text that looks like Lean code — it’s operating within the actual Lean development environment.

Three deployment paths are available: the /leanstral command inside Mistral Vibe (their CLI tool), a free API endpoint (labs-leanstral-2603) for community feedback, or self-hosted deployment using vLLM with the Apache 2.0 weights from Hugging Face.

The Numbers That Started a Debate

Mistral introduced FLTEval, a benchmark designed to measure proof engineering capability in realistic repository settings (not just isolated math competition problems). Here’s how Leanstral stacks up:

Against open-source models:

Model	Parameters	FLTEval Score
GLM5	744B	~16.6
Kimi-K2.5	1T	~20.1
Leanstral (pass@1)	119B (6.5B active)	21.9
Qwen3.5 (pass@4)	397B	25.4
Leanstral (pass@2)	119B (6.5B active)	26.3
Leanstral (pass@16)	119B (6.5B active)	31.9

Leanstral at pass@1 already beats GLM5-744B and Kimi-K2.5-1T. At pass@4, it hits 29.3 — surpassing Qwen3.5-397B. These are models with 6x to 150x more active parameters.

The cost picture is where it gets wild:

Model	Cost per FLTEval Run	Score
Leanstral (pass@2)	$36	26.3
Haiku	$184	23.0
Sonnet	$549	23.7
Leanstral (pass@16)	$290	31.9
Opus	$1,650	39.6

Leanstral at pass@2 costs $36 and scores 26.3 — beating Haiku ($184, score 23.0) and Sonnet ($549, score 23.7) at a fraction of the price. Even at pass@16 ($290), it undercuts Opus by over 80% while reaching 31.9.

But here’s the rub: Opus still wins on raw score at 39.6.

This is where the Hacker News debate got heated. User andai put it bluntly: “If you’re optimizing for correctness, why would cheaper with worse results be relevant?” User jasonjmcghee noted that Leanstral is “specifically trained on this task and significantly underperforms Opus.”

The counter-argument is equally valid: in formal verification, the proof either checks or it doesn’t. A model that gets you 80% of Opus’s performance at 18% of the cost lets teams run dramatically more proof attempts. And since Lean’s compiler acts as the ultimate judge — either the proof compiles or it doesn’t — you can simply try more times.

igravious raised another concern: Mistral’s blog claims about linear scaling may be overstated. While competitors like Qwen and Kimi show roughly linear improvement with more compute, Leanstral’s scaling curve appears to flatten earlier.

How It Compares to the Competition

The formal proof AI landscape has gotten crowded fast:

AlphaProof (Google DeepMind) remains the elephant in the room. It achieved IMO silver-medal performance using Lean 4 — but it’s completely closed-source with no public access. For anyone outside Google, it might as well not exist.

DeepSeek-Prover-V2 is the strongest open-source competitor. Built on DeepSeek-V3’s 671B-parameter base, it scores 88.9% on MiniF2F-test — a different benchmark than FLTEval, making direct comparison tricky. DeepSeek-Prover focuses on competition-style theorem proving (isolated problems), while Leanstral targets proof engineering in real repositories. Different tools for different jobs, though there’s meaningful overlap.

General-purpose models (Claude, GPT) can handle Lean 4 proofs but weren’t trained for it. As the benchmark data shows, even Opus — the strongest performer — costs dramatically more per proof attempt. For teams doing occasional proof work, using a general model makes sense. For teams doing it all day, the economics don’t work.

Leanstral’s niche is the intersection of three things: open-source availability, proof-engineering-specific training, and cost-efficiency. No other model currently occupies that exact spot. Whether the niche is big enough to matter depends on how fast formal verification adoption grows — and there are signs it’s accelerating, driven by safety-critical AI systems, smart contract verification, and regulatory pressure for provably correct software.

The Bigger Picture: Verified Vibe Coding

Mistral is framing Leanstral as a foundation for “trustworthy vibe coding” — the idea that AI-generated code should come with mathematical guarantees, not just vibes and prayers.

The workflow looks like this: instead of writing tests and hoping you covered the edge cases, you write a formal specification of what the code should do, then let Leanstral generate the proof that your implementation matches the spec. If the proof compiles in Lean 4’s type checker, you have a mathematical guarantee of correctness.

As HN commenter cadamsdotcom pointed out, this approach creates “zero tokens in the context when the code is correct” — once a proof is established, you don’t need to carry around tests, documentation, or explanations. The proof is the documentation.

But TimTheTinker raised the natural follow-up: “How do you verify a Lean 4 spec is correct without human review?” User justboy1987 countered that specs are “10-50x shorter” than implementations, and Lean’s kernel is “small, trusted (~10,000 lines)” that has been scrutinized by the programming language research community for years.

This is the real value proposition for Leanstral: not replacing human judgment in writing specs, but dramatically reducing the labor cost of proving that implementations match those specs.

FAQ

Is Leanstral free to use?

Yes, in multiple ways. The model weights are Apache 2.0 licensed, meaning you can download and self-host them for any purpose, including commercial use. Mistral also offers a free API endpoint (labs-leanstral-2603) for a limited time to gather community feedback. Within Mistral Vibe, you can access it via the /leanstral command.

How does Leanstral compare to DeepSeek-Prover-V2?

They target different use cases. DeepSeek-Prover-V2 (671B parameters) excels at competition-style theorem proving, scoring 88.9% on MiniF2F-test. Leanstral (119B, 6.5B active) is optimized for proof engineering in real codebases — working with existing repositories, handling imports, navigating library dependencies. DeepSeek-Prover is a bigger, more powerful model; Leanstral is a smaller, more specialized one. They use different benchmarks, making direct comparison difficult.

Can Leanstral replace general-purpose coding assistants?

No. Leanstral is specialized for Lean 4 formal verification. It won’t help you write Python scripts or debug JavaScript. For teams that need a coding assistant that also handles some Lean 4 work, a general model like Claude or GPT is more practical. Leanstral is for teams where formal proofs are a core part of the workflow.

What hardware do I need to self-host Leanstral?

Despite having 119B total parameters, the sparse MoE architecture means you need enough memory to load the full model but only use ~6.5B parameters per inference step. Mistral recommends using vLLM with --tensor-parallel-size 4 and Flash Attention MLA backend. Four high-end GPUs (like A100 80GB or H100) should be sufficient, though exact requirements depend on your batch size and context length settings.

Who should care about Leanstral?

Three groups: researchers working on formalized mathematics, teams building safety-critical software that requires formal verification (aerospace, financial systems, smart contracts), and AI labs exploring verified code generation. If you’ve never heard of Lean 4 before reading this article, Leanstral probably isn’t for you — yet.

Top AI Product