StepFun released Step 3.7 Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model aimed squarely at agentic workflows that mix perception, search, and reasoning. About 11B parameters activate per token, which is the lever that keeps a model this large at Flash-speed throughput.
## Tunable reasoning, native vision
The architecture pairs a 196B language backbone with a 1.8B ViT vision encoder for native image understanding, and supports a 256k context window — long enough to swallow a financial filing in one pass. The interesting product knob is three selectable reasoning levels (low, medium, high), so a developer can trade speed and cost against cognitive depth per call rather than picking once at deployment.
## Built for agentic throughput
Step 3.7 Flash claims up to 400 tokens per second and tightens tool-use reliability over Step 3.5 Flash. Targeted workloads include multi-step search loops with cross-source verification, parsing massive documents end-to-end, and running coding agents concurrently in high-throughput pipelines. It’s available through StepFun’s Open Platform, OpenRouter, and NVIDIA NIM, with DeepInfra, Fireworks AI, and Modal as additional hosts.
## Why it matters
Frontier-class open-weight VLMs that run cheaply are exactly what mid-cost agent products need — closed APIs make per-token economics hard. Tunable reasoning tiers, in particular, are a quietly important UX layer for agents: not every step deserves a deep think.

Leave a comment