There’s a number that should make every AI engineer stop and think: 80,000 to 1. That’s the token-to-parameter ratio of Liquid AI’s new LFM2.5-350M — a model with just 350 million parameters that was trained on 28 trillion tokens. For context, most models see maybe 20 to 100 tokens per parameter during training. Liquid AI fed this thing 80,000 tokens for every single parameter it has. The result is a tiny model that punches absurdly above its weight class, beating competitors with twice as many parameters on benchmarks that matter for real-world deployment.
Released on March 31, LFM2.5-350M landed with a clear thesis: small models don’t have to be dumb. They just need to be trained differently. And the data backs it up.
Not a Transformer, Not an RNN — Something In Between
The secret sauce here isn’t just massive data. It’s architecture. LFM2.5-350M doesn’t use the standard Transformer backbone that powers most language models. Instead, it’s built on what Liquid AI calls Linear Input-Varying Systems, or LIVs — a framework that treats convolutions, recurrences, and attention as special cases of a single unified operator whose weights are generated on-the-fly from the input itself.
The practical implementation is a hybrid stack: 10 Double-Gated LIV Convolution Blocks handle the bulk of sequence processing, while 6 Grouped Query Attention blocks handle precise retrieval and long-range context. The LIV blocks give you the speed of an RNN with the expressiveness of attention, and the sparse use of GQA blocks means the KV cache overhead stays tiny.
This isn’t just theoretical elegance. It directly enables two things that pure Transformers at this scale can’t match: constant-state memory (so the model doesn’t blow up RAM during inference) and massive parallelization during training (so you can actually push 28 trillion tokens through a 350M model without it taking forever). The model supports a 32,768-token context window — respectable for anything at this parameter count.
Liquid AI spun out of MIT CSAIL in 2023, founded by researchers Ramin Hasani, Mathias Lechner, Alexander Amini, and Daniela Rus. Their earlier work on “liquid neural networks” — inspired by the neural circuits of the C. elegans worm — laid the groundwork for this architecture. The company has raised $297 million to date, including a $250 million Series A led by AMD at a $2.3 billion valuation. AMD’s involvement isn’t just financial — their Ryzen AI chips are one of the primary deployment targets for these models.
The Benchmarks: Beating Models with Twice the Parameters
Numbers tell the story better than adjectives. Here’s how LFM2.5-350M stacks up against its closest competitors — IBM’s Granite 4.0-H-350M (same parameter count, hybrid SSM architecture), Alibaba’s Qwen3.5-0.8B (more than twice the size at 800M parameters), and Google’s Gemma 3 1B (nearly three times the size):
On GPQA Diamond, a graduate-level reasoning benchmark, LFM2.5-350M scores 30.64. Granite 4.0-H-350M gets 22.32 — a 37% gap between two models of the same size. Qwen3.5-0.8B, despite having more than double the parameters, only reaches 27.41.
On IFEval, which measures how well a model follows complex instructions, LFM2.5-350M hits 76.96. That’s a blowout. Qwen3.5-0.8B manages 59.94. Gemma 3 1B gets 63.49. Even Granite’s hybrid variant only reaches 61.27. A 350M model outscoring an 800M model on instruction following by 17 points is not a rounding error — that’s a fundamental gap.
IFBench tells a similar story: LFM2.5-350M at 40.69 versus Qwen3.5-0.8B at 22.87 and Granite 4.0-H-350M at 17.22. Nearly double the next best competitor at the same scale.
But the really dramatic numbers show up on applied tasks — the stuff that actually matters for production. On CaseReportBench (structured data extraction from medical case reports), LFM2.5-350M scores 32.45. Its predecessor LFM2-350M got 11.67. Granite 4.0-350M? 0.84. Not a typo — less than one percent. On BFCLv3 (function calling), LFM2.5-350M leads with 44.11 versus Granite’s 39.58 and its own predecessor’s 22.95. And on the Tau-bench telecom and retail scenarios, which test real-world tool use, LFM2.5-350M nearly doubles its predecessor’s scores.
The one benchmark where Qwen3.5-0.8B clearly wins is MMLU-Pro — 37.42 versus LFM2.5-350M’s 20.01. That’s the pure knowledge benchmark, and more parameters means more memorized knowledge. Fair enough. Liquid AI is explicit about this: LFM2.5-350M is not designed for knowledge-intensive tasks, math, code generation, or creative writing. It’s designed for data extraction, tool use, and structured output. And at those tasks, nothing at this scale comes close.
313 Tokens Per Second on a CPU — The Edge Deployment Story
Raw benchmarks are one thing. What makes LFM2.5-350M genuinely interesting for production is where and how fast it runs.
On an NVIDIA H100 GPU at high concurrency, the model pushes 40,400 output tokens per second. That translates to over 3.5 billion tokens per day on a single GPU. For large-scale data processing pipelines — scraping, extraction, classification — that throughput-to-cost ratio is hard to beat.
But the edge numbers are where things get really compelling. On an AMD Ryzen AI Max 395+ CPU, LFM2.5-350M decodes at 313 tokens per second while using just 434MB of memory. On a Qualcomm Snapdragon Gen4 — the kind of chip in a flagship smartphone — it hits 188 tok/s. Even an iPhone 13 Mini runs it at 88 tok/s with only 56MB of memory. A Raspberry Pi 5 manages 30 tok/s.
The memory footprint is the key enabler. Quantized to 4-bit, the model fits under 500MB. On a Snapdragon GPU with RunAnywhere Q4 quantization, peak memory is just 81MB. That’s small enough to run alongside other apps on a phone without anyone noticing.
Liquid AI has lined up day-one support across a breadth of inference frameworks that’s unusual for a model this size: llama.cpp, MLX, vLLM, SGLang, ONNX, and OpenVINO. They’ve also partnered with hardware-specific optimizers — AMD, Qualcomm, Intel, Apple Silicon, and startups like Zetic, RunAnywhere, and Mirai are all in the ecosystem. LM Studio users can grab the model directly.
The production validation came from Distil Labs, which fine-tuned LFM2.5-350M for multi-turn tool-calling scenarios across smart home, banking, and terminal workflows. After fine-tuning, the model achieved 96-98% tool call equivalence with a 120-billion-parameter teacher model. A 350M model matching a 120B model on tool calling — that’s the kind of result that changes deployment architectures.
Where LFM2.5-350M Fits in the Small Model Arms Race
The small language model space in 2026 is getting crowded and competitive. IBM’s Granite 4.0 family — which VentureBeat dubbed “Western Qwen” — uses a similar hybrid Mamba/Transformer architecture and has been aggressively expanding. Alibaba’s Qwen3.5 series ranges from 0.8B to 9B parameters, all natively multimodal and Apache 2.0 licensed. Google’s Gemma 3 series and Microsoft’s Phi-4 family both target the sub-4B range.
LFM2.5-350M carves out a very specific niche that none of these competitors occupy. It’s not trying to be a general-purpose chatbot. It’s not multimodal. It doesn’t do reasoning chains or write poetry. What it does is run reliable data extraction and function calling at the absolute minimum viable parameter count, fast enough and small enough to embed in places where even a 1B model is too heavy.
Think IoT devices processing sensor data. Think on-device agents in mobile apps that need to parse APIs and call tools without a round trip to the cloud. Think high-throughput pipelines where you’re processing millions of documents and every dollar of GPU time matters. At 40K tok/s on an H100, you can run data extraction across a massive corpus for a fraction of what a larger model would cost.
The jump from LFM2 to LFM2.5 is worth noting too. By tripling the training data from 10T to 28T tokens and adding multi-stage reinforcement learning, Liquid AI roughly doubled the model’s scores on function calling and nearly tripled its data extraction performance. Same 350M parameters, same architecture — just more and better training. That’s a strong argument for the “compute optimal training” thesis: we’re nowhere near the ceiling of what small models can do if you train them hard enough and smart enough.
The model weights are open on Hugging Face under the LiquidAI organization, along with base model weights for anyone who wants to fine-tune. The LEAP platform handles customization and deployment for enterprise users.
The Real Lesson: Parameters Are Overrated
For the past three years, the AI industry has been locked in a parameter count arms race. Bigger is better. Trillion-parameter models are the frontier. And at the capabilities frontier, that’s largely true — you need scale for reasoning, creativity, and broad knowledge.
But for the vast majority of production AI workloads, you don’t need reasoning. You need reliability. You need a model that can parse a JSON schema, call the right API, extract the right fields from a document, and do it a billion times without hallucinating. LFM2.5-350M is purpose-built for exactly that class of problem, and it does it in 81MB of RAM on a phone chip.
The 80,000:1 token-to-parameter ratio is the number that matters here. It suggests we’ve been under-training small models by orders of magnitude. If you can make a 350M model beat a 800M model by feeding it 28 trillion tokens, the question isn’t whether to build bigger models — it’s whether we’ve been wasting compute scaling parameters when we should have been scaling data instead.
You Might Also Like
- Hypura Runs a 31gb Model on a 32gb mac at 2 2 tok s Llama cpp Just Ooms
- Ollama mlx on Apple Silicon 1810 Tokens sec Prefill and the end of Llama cpp on mac
- Ggml Llama cpp Joins Hugging Face and Honestly it was Only a Matter of Time
- Openai Just Acquired Promptfoo the 86m ai Security Startup Used by 25 of Fortune 500
- Starcloud put an Nvidia H100 in Orbit 17 Months Later its Worth 1 1b

Leave a comment