PrismML Exits Stealth With $16M and a 1-Bit Model That Rivals Llama 3 at 1/16th the Memory

An 8-billion-parameter model that fits in 1 GB of memory. Not a quantized approximation of a bigger model. Not a research paper that’ll never ship. A production-ready LLM, trained from scratch with 1-bit weights, running at 368 tokens per second on an RTX 4090 and 44 tokens per second on an iPhone. PrismML came out of stealth on March 31st, dropped three open-source models under Apache 2.0, and made a claim that’s either going to age very well or very poorly: 1-bit is the future of edge AI.

The company is a Caltech spinout led by professor Babak Hassibi, with $16.25 million in backing from Khosla Ventures and Cerberus Ventures (the latter founded by Amir Salek, who ran Google’s TPU program). They also got compute grants from Google and Caltech. The pitch is straightforward — if you can train a model natively at 1-bit precision without losing meaningful accuracy, you unlock an entirely different deployment economics. Your phone becomes a viable inference device. Your datacenter power bill drops by 80%. And you stop paying the cloud tax every time a user sends a prompt.

That’s a big if. Let’s see what the numbers actually say.

1 GB vs. 16 GB: What 1-Bit Precision Actually Buys You

The flagship model, 1-bit Bonsai 8B, stores every parameter as a single bit. Traditional FP16 models use 16 bits per parameter — so an 8B model needs roughly 16 GB just for the weights. Bonsai 8B needs 1.15 GB. That’s not a minor optimization. That’s the difference between “runs on a server” and “runs on a phone with room to spare.”

The math behind this matters. When your weights are limited to {-1, 0, +1}, matrix multiplications — the single most expensive operation in LLM inference — collapse into additions and subtractions. No floating-point math. No expensive GPU tensor cores required. This is why 1-bit models are insanely fast on CPUs and edge hardware: the silicon is doing fundamentally simpler operations.

PrismML didn’t invent this idea. Microsoft’s BitNet research pioneered the concept of training natively at 1-bit, and their BitNet b1.58 2B4T proved it was viable at the 2B scale. But PrismML is the first to push it to 8B parameters and claim commercial viability. That’s a significant jump — going from “interesting research” to “you can actually build products on this.”

PrismML also ships two smaller models. The 4B version (0.57 GB) hits 132 tokens per second on an M4 Pro. The 1.7B version (0.24 GB) runs at 130 tokens per second on an iPhone 17 Pro Max. For context, most full-precision 8B models simply cannot run on a phone at all. Even Apple’s heavily optimized on-device models top out around 20 tokens per second on the A19 Pro for 8B-class inference. PrismML’s 1.7B variant is more than 6x faster on the same class of hardware.

The Benchmark Reality Check

Here’s where things get nuanced, and where PrismML’s story requires honest examination.

On a composite average across IFEval, GSM8K, HumanEval+, BFCL, MuSR, and MMLU-Redux, Bonsai 8B scores 70.5. For comparison, Qwen3 8B scores 79.3 and Olmo3 7B scores 70.9. So Bonsai 8B is roughly competitive with Olmo3 — a solid open-source model — but falls about 9 points short of the current best-in-class 8B model.

That gap matters if you’re comparing apples to apples on a cloud GPU. But PrismML isn’t playing that game. The metric they’re pushing is “intelligence density” — performance per unit of model size. On that measure, Bonsai 8B scores 1.06. Qwen3 8B scores 0.096. That’s more than 10x the intelligence packed per gigabyte. And it makes sense: Bonsai is doing 89% of Qwen3’s job with 7% of the memory.

The throughput numbers tell the real story. On an RTX 4090, Bonsai 8B pushes 368 tokens per second versus 59 for a standard 16-bit 8B model. On an M4 Pro, it’s 131 vs. 16. That’s an 8x speed advantage. On energy efficiency, Bonsai consumes 0.074 milliwatt-hours per token on an M4 Pro, compared to 0.415 for 16-bit — roughly 5.6x more efficient. When you’re running on battery, that’s the difference between an app that kills your phone in an hour and one that runs all day.

The honest take: if you’re deploying in the cloud with unlimited GPU budget, Qwen3 8B is the better model. But if your use case involves phones, laptops, embedded devices, robotics, or anything where memory and power are constraints — PrismML is operating in a different league. The model that can actually run on your hardware beats the model that can’t, regardless of what the leaderboard says.

The Competitive Landscape: Microsoft, Apple, and Qualcomm All Want This Space

PrismML isn’t arriving into a vacuum. The on-device AI race is heating up from multiple directions.

Microsoft’s BitNet research laid the theoretical foundation, and their b1.58 2B4T model proved that native 1-bit training works. But Microsoft’s model tops out at 2B parameters and 0.4 GB. It’s a research artifact, not a product platform. PrismML is building on the same principle but at 4x the scale with an explicit commercial roadmap.

On the hardware side, Apple’s A19 Pro and Qualcomm’s Snapdragon 8 Elite Gen 5 both pack around 75 TOPS of neural processing power. Apple’s on-device intelligence can run 8B models, but slowly — and the tooling is still catching up. Google’s AI Edge Gallery has made on-device inference more accessible, but relies on conventional quantized models that are still memory-hungry.

The open-source ecosystem has been moving fast too. Meta’s Llama 3.2 (1B/3B), Google’s Gemma 3, and Microsoft’s Phi-4 mini all target edge deployment. Ollama’s MLX integration just showed what Apple Silicon can do with optimized inference. And llama.cpp’s GGUF ecosystem has become the de facto standard for running quantized models on consumer hardware.

But all of these approaches share a fundamental limitation: they start with full-precision models and compress them after training. Post-training quantization always loses information. The more you compress, the more you lose. Going from FP16 to 4-bit is roughly a 4x reduction with noticeable quality loss. Going to 1-bit via post-training quantization would be catastrophic.

PrismML’s approach — training natively at 1-bit from scratch — avoids this entirely. The model never had full-precision weights to lose. It learned to think within the constraint. That’s the core technical insight, and it’s why PrismML can claim 14x compression with minimal accuracy loss while post-training quantization tops out around 4x before things fall apart.

Why Khosla and an Ex-Google TPU Lead Are Betting on This

Vinod Khosla put it bluntly: “AI’s future will be defined by who can deliver the most intelligence per unit of energy and cost.” He’s not wrong. The current trajectory of AI is wildly unsustainable — training runs that cost hundreds of millions of dollars, inference that burns through GPUs at $2-3 per hour, and datacenters that are straining power grids. If 1-bit models can deliver 80-90% of the capability at 5-15% of the resource cost, the economics flip completely.

Amir Salek, the Cerberus Ventures partner who backed PrismML, spent years running Google’s TPU program. His thesis is that power consumption is the “ultimate bottleneck” for AI scaling. It’s not compute. It’s not data. It’s the electricity bill. A model that needs 5x less energy per token doesn’t just save money — it changes what’s physically possible to deploy.

The timing is interesting too. Edge AI hardware has finally caught up to where 1-bit models become practical. The latest NPUs and neural engines in phones and laptops can handle the kinds of operations that 1-bit inference demands. Two years ago, even if you had a 1 GB model, the hardware couldn’t run it fast enough to be useful. Now it can, and PrismML is positioned to exploit that convergence.

Hassibi, PrismML’s CEO, sees this as the beginning, not the destination. “We see 1-bit not as an endpoint, but as a starting point,” he said. The implication is clear: scale up from 8B. If 1-bit Bonsai at 8B can match a 70-point benchmark composite, what does 1-bit at 30B or 70B look like? A 70B 1-bit model would need about 8-9 GB of memory — roughly what a current 8B FP16 model requires. You could run GPT-4-class intelligence on a laptop. That’s the endgame they’re aiming at.

What This Means If You’re Building On-Device AI

Three things to watch. First, the developer ecosystem. PrismML’s models are on Hugging Face under Apache 2.0 with a published whitepaper. The barrier to trying them is zero. But 1-bit inference requires different optimization than standard GGUF or ONNX pipelines. How quickly PrismML builds out tooling and framework integrations will determine whether developers actually adopt this.

Second, the benchmark gap. That 9-point deficit to Qwen3 8B isn’t catastrophic, but it’s not nothing. For many edge use cases — function calling, summarization, translation, basic reasoning — 70.5 is more than sufficient. For complex multi-step reasoning or coding tasks, it might not be. The question is whether PrismML can close that gap with training improvements, or whether 1-bit precision has a fundamental ceiling.

Third, the competitive response. Microsoft has the BitNet research and the resources to scale it. Meta could train a 1-bit Llama variant tomorrow if they decided to prioritize it. Google controls the TPU infrastructure that PrismML trained on. Any of these companies could decide that 1-bit is strategic and flood the zone with competing models. PrismML’s advantage is speed and focus — they’re all-in on this architecture while the big labs are spread across a dozen priorities.

One thing is hard to argue with: running a real 8B model in 1 GB of memory, at 368 tokens per second on a desktop GPU and 44 on a phone, with Apache 2.0 licensing and $16M in smart money backing? That’s a compelling starting point, no matter where the technology ceiling turns out to be.

Top AI Product

Leave a comment Cancel reply