Mamba-3 Scores 4% Higher Than Transformers at 7x the Speed — and It’s Fully Open Source

For nearly a decade, Transformers have been the unchallenged default architecture for language models. Challengers have come and gone — RNNs, LSTMs, various state space experiments — but none managed to beat the Transformer on both quality and speed at the same time. They’d win on efficiency but lose on accuracy, or match performance but only on narrow benchmarks.

Mamba-3 changes that equation. Released on March 17 by a research team spanning Together AI, Carnegie Mellon University, Princeton, and Cartesia AI, this third-generation state space model (SSM) outperforms Transformers by nearly 4% on language modeling benchmarks while running up to 7x faster on long sequences. The paper has been accepted at ICLR 2026, and the code is open source under Apache 2.0.

Whether this signals the start of a real architectural shift or another false alarm depends on details that are worth digging into.

The Benchmark Numbers That Got Everyone’s Attention

At the 1.5-billion-parameter scale, the most advanced MIMO variant of Mamba-3 hit 57.6% average accuracy across downstream evaluation tasks — a 2.2-percentage-point improvement over the standard Transformer baseline. That may sound modest, but at this parameter scale, gains of that size are hard-won.

The speed numbers tell an even more striking story. On an H100 GPU processing 16,384-token sequences, here’s how the models compare on combined prefill + decode latency:

Model	512 tokens	2,048 tokens	16,384 tokens
Llama-3.2-1B (vLLM)	4.45s	20.37s	976.50s
Mamba-2	4.66s	18.62s	149.02s
Gated DeltaNet	4.56s	18.22s	145.87s
Mamba-3 (SISO)	4.39s	17.57s	140.61s
Mamba-3 (MIMO r=4)	4.74s	18.96s	151.81s

At short sequences, the difference is marginal. At 16K tokens, Mamba-3 SISO finishes in 140 seconds while Llama-3.2-1B takes over 16 minutes — roughly 7x slower. The gap widens as sequence length increases because Transformers scale quadratically with input length (O(n²)), while SSMs like Mamba-3 scale linearly (O(n)).

The MIMO variant trades a small amount of speed for better accuracy — more than 1 percentage point above SISO at the 1B scale — while still comfortably outpacing the Transformer.

Three Technical Bets That Paid Off

Mamba-3 isn’t just Mamba-2 with more training data. The architecture makes three specific changes, each addressing a known weakness of earlier SSMs.

Exponential-trapezoidal discretization. Previous Mamba versions used first-order Euler discretization — essentially the simplest way to convert continuous dynamics into discrete steps. Mamba-3 upgrades to a second-order accurate method. This seemingly small mathematical change produces a more expressive recurrence formula and eliminates the short causal convolution that had been a crutch of earlier architectures. It’s the kind of improvement that doesn’t make headlines but shows up clearly in the benchmark numbers.

Complex-valued state tracking. By introducing complex numbers into the state update rule — what the researchers call the “RoPE trick” — Mamba-3 gains the ability to solve synthetic reasoning tasks that were flatly impossible for Mamba-2. Earlier SSMs struggled with tasks requiring precise positional awareness, a well-known limitation that critics frequently cited. The connection between complex-valued SSMs and rotary position embeddings (RoPE), a technique widely used in Transformers, provides a theoretical bridge between the two architecture families.

MIMO (Multi-Input, Multi-Output). Previous Mamba versions used SISO (Single-Input, Single-Output) recurrence. Mamba-3’s MIMO variant processes multiple input and output channels simultaneously, increasing arithmetic intensity during decoding. The cleverness here is that MIMO increases compute per token without proportionally increasing memory usage — and since decoding is memory-bound on modern GPUs, the extra compute is essentially free. The result: higher accuracy at the same wall-clock decoding speed.

The research was led by student researchers Aakash Lahoti and Kevin Y. Li at CMU, building on the foundational architecture created by Albert Gu (CMU/Cartesia AI) and Tri Dao (Princeton/Together AI) — the same duo behind the original Mamba in 2023.

Why Inference Efficiency Is the Metric That Matters in 2026

Mamba-2 optimized for training speed. Mamba-3 explicitly flips that priority to inference efficiency. This isn’t an arbitrary choice — it reflects where the bottleneck has moved in real-world AI deployment.

With the rise of agentic workflows, chain-of-thought reasoning, and multi-turn applications, models spend far more of their compute budget generating tokens than processing training data. A model that’s 7x faster at inference can serve 7x more users at the same cost, or handle 7x longer contexts within the same latency budget.

At the token-generation level, the numbers are stark: a 1.4B Mamba model produces 1,446 tokens per second, while a 1.3B Transformer manages 344 tokens per second. For applications like coding assistants, real-time agents, or any system making dozens of sequential LLM calls, that difference translates directly into user experience and infrastructure cost.

Together AI, one of the organizations behind the research, is also the company behind the Together Inference platform. The inference-first design philosophy of Mamba-3 aligns neatly with their business model — faster inference means cheaper API calls.

The Skeptics Raise Valid Points

Mamba-3 hit the Hacker News front page on March 21 with 116 points and 18 comments, and the discussion surfaced some legitimate critiques.

The most pointed criticism came around batch inference. One commenter argued that the speed comparisons are misleading because they measure batch-size-1 inference, which is memory-bound. In production, no API provider runs batch-size-1 — they group requests together to maximize GPU utilization. When you batch, the bottleneck shifts from memory bandwidth to compute, and the increased compute per token in MIMO variants could actually reduce the maximum batch size a GPU can handle.

This is a fair point. The latency benchmarks above show single-request performance, which matters for latency-sensitive applications but doesn’t directly translate to throughput in a high-traffic API setting. The researchers acknowledge this tradeoff: MIMO “requires longer training times but maintains inference speed due to the compute-bound nature of training versus memory-bound decoding operations.”

Another limitation: Mamba-3 still underperforms Transformers on retrieval-heavy benchmarks. Tasks that require the model to look up and recall specific information from long contexts remain a weak spot for SSMs. The linear-time processing that makes SSMs fast also means they can’t attend to arbitrary positions in the input as freely as Transformers can. Hybrid architectures — combining SSM layers with a few attention layers — may end up being the practical solution here.

Where Mamba-3 Stands in the SSM Landscape

Mamba-3 doesn’t exist in a vacuum. The SSM space has grown crowded, with multiple architectures competing to dethrone Transformers.

RWKV takes an RNN-inspired approach with linear attention. It’s strong in multilingual tasks and has an active open-source community, but its latest versions haven’t matched Mamba-3’s benchmark scores at comparable parameter scales.

xLSTM modernizes the LSTM architecture with exponential gating and matrix memory. It has shown strength in specific domains — xLSTM-UNet outperforms both Transformers and Mamba-based models on medical image segmentation — but hasn’t demonstrated the same breadth of language modeling improvements.

Griffin (from Google DeepMind) combines gated linear recurrences with local attention. It performs well but relies partly on attention mechanisms, making it more of a hybrid than a pure SSM challenger.

Mercury 2 (from Inception Labs) takes a completely different approach — diffusion-based language modeling that generates text in parallel rather than autoregressively. It claims 1,000+ tokens per second but uses a fundamentally different paradigm that makes direct comparison tricky.

What sets Mamba-3 apart is the combination of pure SSM architecture (no attention layers needed), open-source availability under Apache 2.0, and strong institutional backing from Together AI and two major research universities. The GitHub repository (state-spaces/mamba) already has 17,500+ stars, reflecting broad community interest.

The emerging consensus in the research community is that hybrid models — mixing SSM and attention layers — will likely dominate production deployments. But Mamba-3 proves that a pure SSM can compete head-to-head with Transformers, which raises the ceiling for what those hybrids can achieve.

What This Means Going Forward

Mamba-3 is currently demonstrated at the 1.5B parameter scale. The big question is whether these advantages hold as models scale to tens or hundreds of billions of parameters. The original Transformer architecture didn’t prove its dominance at small scales either — it was the scaling behavior that ultimately won out.

The ICLR 2026 acceptance adds academic credibility, and the Apache 2.0 license lowers the barrier for commercial adoption. If Together AI or others train larger Mamba-3 variants and the efficiency gains persist, the default assumption that “Transformer = best” will face its most serious challenge yet.

For now, Mamba-3 at 1.5B is most immediately relevant for edge deployment, mobile applications, and any scenario where inference cost per token is the primary constraint. In those use cases, 7x faster inference isn’t an academic curiosity — it’s a direct cost reduction.

FAQ

Is Mamba-3 free to use?
Yes. Mamba-3 is released under the Apache 2.0 open-source license, which permits both personal and commercial use. The code is available on GitHub.

How does Mamba-3 compare to Llama-3.2-1B?
At the 1.5B parameter scale, Mamba-3 outperforms Meta’s Llama-3.2-1B (a Transformer model) by approximately 4% on language modeling benchmarks. On long-sequence inference (16K tokens), Mamba-3 is roughly 7x faster. However, Llama-3.2 still has an edge on retrieval-heavy tasks where attention mechanisms are beneficial.

Can Mamba-3 replace Transformers in production?
Not as a drop-in replacement today. Mamba-3 has been validated at the 1.5B scale, which is relatively small compared to production models. It also underperforms on retrieval tasks. For many applications, hybrid architectures combining SSM and attention layers are the more practical choice.

What are the main competitors to Mamba-3?
In the SSM space, the main alternatives are RWKV (RNN-inspired linear attention), xLSTM (modernized LSTM), and Griffin (Google DeepMind’s gated linear recurrence). Mercury 2 from Inception Labs takes a different approach using diffusion-based language modeling. Each has different strengths, but Mamba-3 currently leads on combined accuracy and inference speed benchmarks.

Who built Mamba-3?
Mamba-3 was developed by researchers from Carnegie Mellon University, Princeton University, Together AI, and Cartesia AI. The lead researchers are Aakash Lahoti and Kevin Y. Li, building on the original Mamba architecture created by Albert Gu and Tri Dao.

Top AI Product

Leave a comment Cancel reply