Ollama MLX on Apple Silicon: 1,810 Tokens/Sec Prefill and the End of llama.cpp on Mac

Ollama just ripped out its entire inference backend on Mac and replaced it with Apple’s MLX framework. That sentence alone would have been unthinkable a year ago, when Ollama was synonymous with llama.cpp and GGUF quantization. But the numbers from their March 30 blog post make the decision look obvious in hindsight: prefill jumped from 1,154 tokens/sec to 1,810 tokens/sec. Decode went from 58 tokens/sec to 112. Nearly 2x on decode. On the same hardware. Same model. Just a better backend.

The Hacker News post hit 262 points with 116 comments in hours. Reddit’s r/LocalLLaMA lit up. And for good reason — this isn’t an incremental update. It’s Ollama admitting that llama.cpp, the engine that powered local LLM inference for millions of users, was leaving massive performance on the table on Apple Silicon.

Why MLX Changes Everything for Mac Users

The core issue was always architectural. llama.cpp was built for NVIDIA GPUs and cross-platform compatibility. It works on Mac, sure, but it was never designed around Apple’s unified memory architecture — the thing that makes M-series chips special for AI workloads. MLX is. Apple built MLX specifically to exploit unified memory, where the CPU and GPU share the same physical memory pool with zero-copy access. No data shuttling between CPU RAM and GPU VRAM. No bottleneck.

On older chips (M1 through M4), the switch to MLX already delivers meaningful gains. But the real fireworks come from the M5 family, which introduced GPU Neural Accelerators — dedicated matrix-multiplication hardware baked into each GPU core. Apple’s own research shows up to 4x the peak GPU compute for AI workloads compared to M4. MLX knows how to talk to these accelerators through Metal 4 and Apple’s TensorOps framework. llama.cpp does not.

The practical impact: Ollama running Qwen 3.5 35B (a mixture-of-experts model with 3 billion active parameters) on a MacBook Pro now feels genuinely responsive. 112 tokens/sec decode means you’re reading output at roughly the speed of comfortable scanning. Prefill at 1,810 tokens/sec means a 4,000-token prompt processes in about two seconds. This is not cloud API speed, but it’s fast enough that the difference stops mattering for most interactive use cases.

The NVFP4 Angle Nobody Is Talking About

Buried in the announcement is a detail that deserves its own spotlight: Ollama now supports NVIDIA’s NVFP4 quantization format on Mac via MLX.

NVFP4 is a 4-bit floating point format originally designed for NVIDIA’s Blackwell GPUs. It uses a two-level scaling strategy — a fine-grained E4M3 scaling factor per block of 16 values, plus a per-tensor FP32 scalar. The result is 4-bit weights that maintain accuracy within 1% of FP16 on standard benchmarks (HellaSwag, MMLU, PiQA). That’s not “close enough.” That’s statistically indistinguishable for most practical purposes.

Why does this matter on Mac, where there are no Blackwell GPUs? Because the format itself is good, independent of the hardware it was originally designed for. NVFP4 groups values into blocks of 16 (versus MXFP4’s blocks of 32), which means more localized adaptation to each tensor’s dynamic range. Less quantization error. Better accuracy at the same bit width. MLX can decode these formats natively, so Ollama gets the accuracy benefits of NVFP4 without needing the NVIDIA hardware.

The benchmark from Ollama’s blog used Qwen3.5-35B-A3B quantized to NVFP4, compared against their previous implementation using Q4_K_M (the standard GGUF 4-bit quantization). Same model, same bit width, meaningfully better quality. And they’re already teasing Ollama 0.19 with int4, which pushes numbers even higher — 1,851 tokens/sec prefill and 134 tokens/sec decode.

Ollama vs. LM Studio vs. llama.cpp: Where Things Stand Now

This move reshuffles the competitive landscape for local LLM tools on Mac.

LM Studio has been using MLX as its Mac backend for a while now, and the performance advantage was real — community benchmarks consistently showed 2-3x throughput versus Ollama’s llama.cpp backend. LM Studio also leverages MLX’s memory efficiency, letting users run larger models on the same hardware. That gap is now closed. Ollama with MLX should match or approach LM Studio’s raw inference speed, while keeping Ollama’s terminal-first workflow, OpenAI-compatible API, and open-source codebase.

llama.cpp remains the most configurable option, and it’s still the backbone for cross-platform inference. The GGML ecosystem is massive, and GGUF is the most widely available quantization format on Hugging Face. But on Apple Silicon specifically, MLX is simply better. It knows about unified memory. It knows about Neural Accelerators. It knows about Metal. llama.cpp is bringing a cross-platform hammer to a platform-specific problem.

For the average Mac user running local models, the practical upshot is simple: update Ollama and get roughly 2x faster inference for free. No configuration changes. No model re-downloads. Same ollama run command, dramatically different experience.

What This Means for the M5 Generation

Apple has been quietly building an AI hardware story that doesn’t depend on NVIDIA. The M5 Max ships with up to 128GB of unified memory and a 40-core GPU where every core has a Neural Accelerator. Apple’s machine learning research team published benchmarks showing the Neural Accelerators deliver up to 4x speedup on time-to-first-token compared to M4. Memory bandwidth jumped 28% from M4 to M5 (120 GB/s to 153 GB/s), which directly translates to faster token generation since LLM inference is memory-bandwidth-bound at decode time.

128GB unified memory on M5 Max means you can load a 70B parameter model in 4-bit quantization and still have headroom for your OS and applications. On NVIDIA hardware, that requires a workstation GPU or multiple consumer cards. On a MacBook Pro, you close the lid and take it to a coffee shop.

Ollama switching to MLX is the software side catching up to the hardware story Apple has been telling. The M5 chips were designed for this kind of workload. MLX was built to extract it. And now the most popular local LLM tool on Mac actually uses the stack Apple intended.

The local AI inference space has been fragmented for years — different tools, different backends, different quantization formats, all partially optimized for different hardware. Ollama going all-in on MLX for Apple Silicon is a bet that platform-native performance beats cross-platform convenience. Based on the benchmarks, it’s a bet that’s already paying off. The 262-point Hacker News thread isn’t just about speed numbers. It’s about millions of Mac users realizing that their hardware was always capable of this — it just needed the right software to unlock it.

Top AI Product

Leave a comment Cancel reply

Ollama MLX on Apple Silicon: 1,810 Tokens/Sec Prefill and the End of llama.cpp on Mac

You Might Also Like

Share this:

Discover more from Top AI Product

Leave a comment Cancel reply