Gemma 4 Scores 89% on AIME With Just 4B Active Parameters — Google’s Open Model Bet Gets Real

Google has been playing defense in the open model race for months. Llama 4 grabbed headlines. Qwen 3.5 dominated coding benchmarks. Gemma 3, despite solid performance, kept losing enterprise deals over one thing that had nothing to do with intelligence: its license.

That changed on April 2. Gemma 4 dropped with four model sizes, vision across the entire lineup, and — the part that actually matters to anyone shipping products — an Apache 2.0 license. No custom clauses. No “Harmful Use” carve-outs that require a call to legal. No MAU limits. The same permissive terms that Qwen, Mistral, and basically every other serious open model already ships with.

Hugging Face co-founder Clément Delangue called it “a huge milestone.” He’s not wrong. But the license is only half the story.

The Numbers That Matter

Gemma 4 comes in four sizes: E2B and E4B for edge devices, a 31B dense model, and a 26B-A4B Mixture-of-Experts variant. The MoE is the star of the show.

The 26B-A4B has 26 billion total parameters but only activates 3.8 billion per forward pass. It uses 128 small experts, activating 8 plus 1 shared expert per token. The result: it runs almost as fast as a 4B model while delivering roughly 97% of the dense 31B’s quality. On LMArena’s text leaderboard, the 31B ranks number 3 among all open models globally, with an ELO around 1452. The MoE sits at number 6.

But the before-and-after comparison with Gemma 3 is where things get wild. On AIME 2026, the 31B scores 89.2%. Gemma 3 27B? 20.8%. On LiveCodeBench v6, the 31B hits 80.0% versus Gemma 3’s 29.1%. On τ2-bench for agentic retail tasks, the 31B scores 86.4% while Gemma 3 managed 6.6%. These aren’t incremental improvements. This is a different class of model wearing the same brand name.

For math, reasoning, and agentic capability, Gemma 4 31B now competes directly with models 20 times its size. The GPQA score lands at 84%, and it handles 256K token context windows across the larger variants.

Why Apache 2.0 Changes Everything

This is the part most technical reviews are burying under benchmark tables, but it’s probably the single biggest factor in whether Gemma 4 actually gets adopted.

Gemma 3’s custom license had usage restrictions that Google could update at will. It included vague “Harmful Use” definitions that enterprise legal teams hated. The practical effect: companies that needed a production-ready open model defaulted to Qwen or Mistral, even when Gemma 3 outperformed them on benchmarks. Performance doesn’t matter if your legal team won’t sign off on the license.

Apache 2.0 eliminates all of that. Full rights to use, modify, and redistribute for any purpose, including commercial applications, with zero royalties. Fine-tuned derivatives can be deployed commercially without ambiguity. No acceptable use policy to interpret.

For enterprise teams that had Gemma on their evaluation list but kept hitting the license wall, the evaluation can finally begin without a call to legal first. VentureBeat put it bluntly: this license change may matter more than the benchmarks.

Edge AI Gets Interesting

The E2B and E4B models are built for phones, Raspberry Pi boards, and Jetson Nano hardware. Google claims up to 4x faster inference than previous Gemma versions and 60% less battery consumption. The E2B is specifically optimized for maximum speed — 3x faster than the E4B — with near-zero latency running fully offline.

Both edge models support native audio input (up to 30 seconds) on top of text and image processing. All four models handle video as frame sequences, up to 60 seconds at one frame per second. Native function calling is built in across the lineup, enabling on-device agentic workflows without cloud round-trips.

NVIDIA already has day-one optimization for RTX GPUs. AMD published day-zero support documentation. Ollama, llama.cpp, MLX, vLLM, and Hugging Face Transformers all have Gemma 4 support ready. Google also announced Gemma 4 in Android’s AICore Developer Preview, which means on-device deployment for Android apps is a first-class citizen.

The language coverage is aggressive: pre-trained on 140-plus languages with out-of-the-box support for 35-plus. Community testing shows multilingual quality is genuinely strong — users testing German, Arabic, Vietnamese, and French report that Gemma 4 outperforms Qwen 3.5 in non-English tasks, with one tester calling it “in a tier of its own” for translation quality.

Where Gemma 4 Falls Short

Not everything is rosy. The community found gaps within 24 hours of release.

The biggest complaint: there’s no 12B model. Gemma 3’s 12B was popular with hobbyists and small-team developers — it hit a sweet spot between edge models and the full-size 27B. Gemma 4 jumps from E4B (4.5 billion effective parameters) straight to the 26B MoE, leaving a hole in the lineup with no direct upgrade path.

On coding benchmarks specifically, Qwen 3.5 still leads. LiveCodeBench and SWE-bench show clear margins over both Gemma 4 and Llama 4. If your primary use case is code generation, Qwen remains the stronger pick among open models.

Llama 4 Scout has a 10 million token context window compared to Gemma 4’s 256K. For most production use cases 256K is more than enough, but for long-document processing or massive codebases, Meta’s offering has an order-of-magnitude advantage.

And some early testers noted that the benchmarks paint a rosier picture than real-world usage. The Register’s coverage was blunt: “Google battles Chinese open weights models with Gemma 4” — framing it as a competitive response rather than a clear victory. That’s probably the most honest read.

The Open Model Landscape in April 2026

Gemma 4’s release makes the open model race a genuine three-way fight. Meta’s Llama 4 leads on context length and has the broadest ecosystem. Alibaba’s Qwen 3.5 dominates coding and has been the enterprise default for teams that need permissive licensing. Google’s Gemma 4 now matches on licensing, leads on reasoning and math benchmarks, and brings the strongest edge deployment story.

The MoE architecture is the dark horse. Getting near-31B quality from 4B active parameters means you can run frontier-class intelligence on hardware that used to be limited to toy models. If the 26B-A4B performs as well in production as benchmarks suggest, it could become the default choice for cost-sensitive deployments where every inference dollar matters.

For anyone building AI products right now, the practical takeaway is simple. If you’re doing on-device or edge AI, Gemma 4 E2B and E4B are best-in-class with no licensing headaches. If you need maximum reasoning and agentic capability in an open model, the 31B dense competes with anything available. If you want the best cost-performance ratio, the 26B MoE running at 4B-model speeds with 31B-model quality is hard to beat.

Google spent the last year watching Qwen and Llama eat its open-model lunch. Gemma 4 is the response — and this time, they brought enough firepower to actually compete.

Top AI Product

Leave a comment Cancel reply

Gemma 4 Scores 89% on AIME With Just 4B Active Parameters — Google’s Open Model Bet Gets Real

You Might Also Like

Share this:

Discover more from Top AI Product

Leave a comment Cancel reply