StepFun Step 3.5 Flash Activates Only 11B of 196B Parameters — and Still Matches GPT-5.2

A Chinese AI startup just dropped a 196-billion-parameter model under Apache 2.0, and the kicker is: it only uses 11 billion of those parameters at any given moment. StepFun’s Step 3.5 Flash hit the top of Hacker News this week with a simple claim — it’s the number one cost-effective model for OpenClaw tasks, beating out every other option in a 300-battle evaluation. The reactions have been a mix of genuine excitement and healthy skepticism.

Let’s talk about what’s actually going on here.

196 Billion Parameters, 11 Billion Doing the Work

The architecture is a sparse Mixture of Experts (MoE) with 45 transformer layers, each containing 288 routed experts plus one shared expert that’s always active. For every token the model processes, only the top 8 experts fire. That means you get the knowledge capacity of a 196B model but the inference speed of an 11B one.

This isn’t a new trick — DeepSeek pioneered this approach — but StepFun has pushed the efficiency further. The model uses a 3:1 ratio of sliding-window attention to full attention, which cuts memory overhead on long contexts dramatically. Combined with 3-way Multi-Token Prediction (MTP-3), the model generates 100 to 300 tokens per second in typical use, peaking at 350 tok/s on single-stream coding tasks.

For context, that’s fast enough to feel like autocomplete, not like waiting for a response.

The INT4 quantized version weighs in at 111.5 GB, which means it actually runs on a Mac Studio with an M4 Max chip. StepFun also confirmed 20 tok/s throughput on NVIDIA DGX Spark with full 256K context support. You can’t run the full-precision version on consumer hardware, obviously, but the fact that a frontier-class model fits on a desktop workstation at all is worth noting.

The Benchmark Numbers — and Why They Matter

Here’s where it gets interesting. StepFun reports an overall intelligence average of 81.0 across eight major benchmarks. For comparison: GPT-5.2 scores 82.2, Claude Opus 4.5 lands at 80.6, and Gemini 3.0 Pro hits 80.7. That puts Step 3.5 Flash essentially neck-and-neck with the best closed-source models on the market.

Breaking it down by category:

On math, AIME 2025 scores 97.3% (99.9% with their PaCoRe enhancement mode). HMMT 2025 averages 96.2%. These are competition-level math problems, and the model is solving them at near-perfect rates.

On coding, SWE-bench Verified hits 74.4% — that’s real-world software engineering tasks, not toy problems. LiveCodeBench-V6 reaches 86.4%. Terminal-Bench 2.0 scores 51.0%, which is decent but not class-leading.

On agentic tasks — the stuff that actually matters for the OpenClaw use case — τ²-Bench scores 88.2 and BrowseComp with Context Manager reaches 69.0. The model was clearly optimized for tool use and multi-step task execution, not just answering questions.

One caveat StepFun themselves acknowledge: the model “relies on longer generation trajectories than Gemini 3.0 Pro to reach comparable quality.” Translation: it sometimes takes more tokens to get to the same answer. That matters for cost calculations, which we’ll get to.

The Cost Story — and Why Hacker News Went Nuts

This is the part that got 80 points and 29 comments on Hacker News.

On OpenRouter, Step 3.5 Flash is priced at $0.10 per million input tokens and $0.30 per million output tokens. DeepSeek V3.2, the obvious comparison point, costs $0.27 per million input and $1.10 per million output. That makes Step 3.5 Flash roughly 2.7x cheaper on input and nearly 3x cheaper on output.

StepFun’s own analysis claims a 1.0x baseline decoding cost on Hopper GPUs at 128K context, compared to 6.0x for DeepSeek V3.2 and a staggering 18.9x for GLM-4.7. If those numbers hold up — and that’s a big “if” since benchmarks are self-reported — this model could become the default backbone for anyone running OpenClaw agents at scale.

The Hacker News thread revealed some interesting real-world data. The OpenClaw cost-effectiveness leaderboard ranked Step 3.5 Flash at number one after 300 battles, with the top three cost-effective models being StepFun 3.5 Flash, Grok 4.1 Fast, and MiniMax M2.7. On raw performance, the top three were Claude Opus 4.6, GPT-5.4, and Claude Sonnet 4.6 — but those cost orders of magnitude more to run.

Not everyone was convinced, though. One developer reported “grave mistakes” including Chinese characters leaking into English outputs and occasional typos — a known issue with bilingual models. Another found it performed poorly compared to Gemini 2.5 Flash on certain tasks. The consensus seems to be: incredible value for the price, but not a replacement for premium models when accuracy is non-negotiable.

StepClaw and the OpenClaw Angle

StepFun didn’t just release a model. They built an entire cloud agent platform around it called StepClaw, launched on March 12, 2026.

StepClaw is built on the OpenClaw framework — the open-source agent platform that’s been dominating GitHub with 247,000 stars. Unlike a chatbot that just answers questions, StepClaw can plan and execute multi-step tasks: invoke tools, run scripts, schedule recurring jobs, integrate with external platforms, and maintain long-term memory across sessions. It’s the difference between asking a question and delegating a task.

The integration makes sense strategically. OpenClaw needs a cheap, fast, reliable backbone model that can handle tool calling without burning through API credits. Step 3.5 Flash, with its native tool-calling support and MCP integration (over 80 Model Context Protocol tools), slots perfectly into that role. If you’re building agents on OpenClaw today, this is probably the most economical option that still delivers frontier-level reasoning.

For more on the OpenClaw ecosystem, we’ve covered several related projects: ClawWork brought workflow automation to the framework, Claw Code focused on the developer experience, and IronClaw pushed the limits of what agents can do in production environments.

StepFun — The Company Behind the Model

StepFun (阶跃星辰) was founded in 2023 by Jiang Daxin, former Microsoft global vice president and computer scientist. The company raised $717 million in a Series B+ round in January 2026, led by Shanghai Guotou Pioneer Fund, China Life Equity, and Pudong Venture Capital, with participation from Tencent, Qiming Venture Partners, and Five Seasons Capital. That funding round outpaced what both Zhipu AI and MiniMax raised in their Hong Kong IPOs the same month.

The company is now reportedly planning its own Hong Kong IPO, targeting around $500 million.

Within the Chinese AI landscape, StepFun occupies an interesting niche. They’re not going after the consumer chatbot market like Kimi or Doubao. They’re not building a search engine like Perplexity. They’re positioning themselves as the infrastructure layer for agentic AI — the model that powers the agents, not the product users interact with directly. The open-source Apache 2.0 license reinforces that play: get the model everywhere, then monetize the platform and enterprise services.

Step 3.5 Flash already has over 300,000 downloads on Hugging Face and hit number one on OpenRouter’s trending list. Training for Step 4 is already underway.

Where This Fits in the Open-Source Race

The open-source model landscape has gotten extremely competitive in the past few months. DeepSeek V4 pushed the boundaries of what’s possible with open weights. Flash-MoE introduced new efficiency techniques. MiniMax M2.7 showed that smaller companies can compete on quality.

Step 3.5 Flash doesn’t necessarily beat all of them on every benchmark. What it does is hit a specific sweet spot: frontier-level reasoning and agentic capability at the lowest inference cost in the market. For anyone building cost-sensitive agent deployments — and that’s most deployments, honestly — that combination is hard to ignore.

The model has real limitations. Bilingual output bleed, longer generation trajectories than some competitors, and self-reported benchmarks that haven’t been fully independently verified. But at $0.10 per million input tokens, the risk-reward math looks pretty good for most use cases.

Step 4 is in training. If StepFun can maintain this cost-to-performance ratio while closing the gap on raw quality, the big model providers should be watching closely.

Top AI Product

Leave a comment Cancel reply

StepFun Step 3.5 Flash Activates Only 11B of 196B Parameters — and Still Matches GPT-5.2

You Might Also Like

Share this:

Discover more from Top AI Product

Leave a comment Cancel reply