Hypura Runs a 31GB Model on a 32GB Mac at 2.2 tok/s — llama.cpp Just OOMs

There’s a frustrating ceiling that every Apple Silicon user running local LLMs hits eventually: your model is slightly too big for your RAM, and everything falls apart. llama.cpp crashes. MLX refuses to load it. The OS starts swapping so aggressively that your entire machine grinds to a halt. You either buy a more expensive Mac or settle for a smaller model.

Hypura, a new open-source inference scheduler written in Rust, takes a different approach. Instead of treating memory as a binary fits-or-doesn’t problem, it distributes model tensors across three storage tiers — GPU, RAM, and NVMe SSD — based on how often each tensor gets accessed during inference. The result: a 31GB Mixtral 8x7B model runs at 2.2 tokens per second on a 32GB Mac Mini, while standard llama.cpp simply crashes with an out-of-memory error on the same hardware.

The project hit Hacker News on March 24, picking up 210 points and 81 comments. It currently sits at 433 GitHub stars under an MIT license, with v0.1.0 released on March 17, 2026.

How the Three-Tier Scheduler Actually Works

The core insight behind Hypura is that not all model weights are accessed equally. In a transformer model, certain tensors — layer norms, embeddings, attention projections — are small and get touched on every single token. Others, like the feedforward weights in dense models or expert weights in Mixture-of-Experts architectures, are massive but accessed less frequently or only partially.

Hypura exploits this by profiling each model’s architecture and assigning tensors to the optimal storage tier:

GPU memory: Tiny, high-frequency tensors like norms and embeddings get pinned here. They never leave.
RAM: Mid-frequency tensors that benefit from fast access but don’t need GPU-speed bandwidth.
NVMe SSD: Large, infrequently accessed tensors — particularly MoE expert weights — get read from disk on demand.

This isn’t the same as OS-level virtual memory or swap. The key difference is that Hypura knows the model architecture. It understands that transformer layers execute sequentially, so it can prefetch the next layer’s weights while the current layer is still computing on Metal. Where your OS would generate thousands of random page faults per token, Hypura issues large sequential reads timed to the inference pipeline.

The scheduler automatically selects one of three inference modes based on your hardware and model:

Full-resident — The model fits entirely in GPU + RAM. Zero NVMe overhead. Same speed as llama.cpp.
Expert-streaming — For MoE models. Non-expert tensors (~1GB) stay on GPU. Expert weights stream from NVMe through a pool buffer, with a neuron cache that hits 99.5% of the time.
Dense FFN-streaming — For large dense models. Attention and norm weights (~8GB) live on GPU. Feedforward weights stream from NVMe with scaled prefetch buffers.

The MoE Optimization Is the Real Story

Where Hypura gets genuinely interesting is its handling of Mixture-of-Experts models. In a standard MoE architecture like Mixtral 8x7B, only 2 out of 8 experts fire per token. That means 75% of the expert weights loaded for any given token are wasted bandwidth.

Hypura intercepts the router layer’s output during inference to identify which experts are actually selected, then loads only those expert strides from NVMe. This router interception cuts I/O by roughly 75% compared to loading all experts for every token.

On top of that, Hypura tracks expert activation frequency over time. Some experts fire far more often than others — a pattern well-documented in research on models like Mixtral and DeepSeek. The neuron cache keeps hot experts in memory, achieving a 99.5% hit rate according to the project’s benchmarks. In practice, this means most tokens don’t require any NVMe reads at all for MoE models.

This is why Mixtral 8x7B at 30.9GB runs at 2.2 tok/s on a 32GB machine — the scheduler only needs to stream a fraction of the model’s total weight per token, and the cache handles the rest.

Benchmarks: What the Numbers Actually Show

All benchmarks were run on the developer’s M1 Max with 32GB of unified memory. Here’s the full picture:

Model	Size (GGUF)	Inference Mode	Speed	llama.cpp
Qwen 2.5 14B	8.4 GB	Full-resident	21 tok/s	~21 tok/s
Mixtral 8x7B	30.9 GB	Expert-streaming	2.2 tok/s	OOM crash
Llama 3.3 70B	39.6 GB	Dense FFN-streaming	0.3 tok/s	OOM crash

The first row is the important sanity check — when a model fits in memory, Hypura performs identically to llama.cpp. There’s no overhead penalty for using the scheduler on models that don’t need it.

The Mixtral result is the headline number: a model that’s just barely too large for the system runs at a usable 2.2 tok/s. Not fast by any stretch, but workable for batch processing, background summarization, or overnight jobs.

The Llama 70B result at 0.3 tok/s is more of a proof-of-concept. The Hacker News discussion was candid about this — multiple commenters noted that sub-1 tok/s makes interactive use impractical. The developer acknowledged that the 70B dense model “is more of a POC than a functional use case” and pointed to smaller MoE models as the practical sweet spot.

One Hacker News commenter calculated that running overnight at 0.3 tok/s for 12 hours would produce roughly 13,000 tokens — about 10,000 words. Whether that’s useful depends entirely on the task.

How Hypura Compares to the Competition

The local LLM inference space on Apple Silicon has several established players, and Hypura occupies a specific niche among them.

llama.cpp remains the default choice for most users. It’s mature, well-optimized for Metal, and handles any model that fits in your unified memory with excellent performance. But its behavior when models exceed RAM ranges from severe slowdowns (via mmap) to outright crashes. llama.cpp issue #20852 discusses NVMe-aware inference, but nothing has shipped yet.

MLX (Apple’s own framework) can be 30-50% faster than llama.cpp on Apple Silicon for models that fit in memory, thanks to deeper Metal optimization. But like llama.cpp, it doesn’t handle the “model too big for RAM” scenario gracefully.

ktransformers takes a similar conceptual approach to Hypura — intelligent expert placement for MoE models — but targets NVIDIA GPUs with separate VRAM, not Apple’s unified memory architecture. It offloads expert parameters to CPU/DRAM while keeping projections in VRAM, which solves a different hardware constraint.

Ollama is a user-friendly wrapper around llama.cpp that many people use for local inference. Hypura smartly exposes an Ollama-compatible HTTP API (/api/generate, /api/chat, /api/tags), making it a drop-in replacement. If your workflow already talks to Ollama, switching to Hypura requires zero code changes.

The honest comparison: if your model fits in memory, use llama.cpp or MLX — they’re faster and more mature. Hypura’s value proposition kicks in specifically when you want to run a model that’s 1-30GB larger than your available RAM, particularly MoE models where sparsity makes NVMe streaming viable.

SSD Wear, Build Process, and Practical Details

A common concern with NVMe-based inference is SSD wear. Hypura addresses this directly: it only reads from your SSD during inference. All operations use read-only pread() calls with F_NOCACHE. No writes, no wear on flash cells. The scheduler also includes a safety guardrail that blocks baseline benchmarking when the model exceeds RAM minus a 4GB headroom buffer.

Building Hypura requires Rust 1.75+ and CMake. It’s a two-crate Cargo workspace — the main hypura crate handles CLI, placement optimization, the inference engine, and server routes, while hypura-sys provides FFI bindings to a vendored copy of llama.cpp. The codebase is 91.8% Rust.

The developer was transparent on Hacker News about using LLM assistance during development, describing the process as “the Socratic method” — using AI as a thinking partner rather than a code generator. The initial drafts and architecture decisions were human-authored.

What the Community Is Saying

The Hacker News reception was broadly positive but nuanced. The 210 points and 81 comments reflect genuine interest, not hype.

Praise centered on the technical approach — using domain-specific knowledge about transformer execution to beat generic OS paging. Several commenters noted this was a meaningful capability improvement for batch and background workloads.

The main criticism was around the benchmark model selection. Commenters pointed out that Qwen 2.5 and Mixtral are “pretty old models” and pushed for benchmarks on newer architectures like Qwen 3.5 MoE and Kimi K2.5. The developer was receptive, noting that community benchmarking contributions were welcome.

There was also practical skepticism about NVMe random-access performance. One commenter noted that M1 Max NVMe speeds can drop to 65MB/s at queue depth 1 for random reads, which could bottleneck expert loading. The developer’s response pointed to MoE sparsity and sequential prefetching as mitigations, though the debate wasn’t fully resolved.

The r/LocalLLaMA community has been discussing related topics around Apple Silicon inference limits, with an M5 Max performance testing post scoring 119 points and driving broader interest in tools like Hypura.

FAQ

Is Hypura free?
Yes. Hypura is open-source under the MIT license. There’s no paid tier, no telemetry, and no usage restrictions.

What Mac hardware do I need to run Hypura?
Any Apple Silicon Mac (M1 or later). The tool benefits most from machines with fast NVMe storage and at least 16GB of unified memory. The developer’s own benchmarks were run on an M1 Max with 32GB. Higher-end chips like the M4 Max (546 GB/s memory bandwidth) will deliver proportionally faster results — roughly 2x the decode speed of an M4 Pro (273 GB/s).

Does Hypura work with all GGUF models?
Hypura loads standard GGUF model files and supports both dense transformer and MoE architectures. MoE models see the biggest benefit due to the expert-streaming optimization. Dense models that exceed RAM will work but at significantly lower speeds (sub-1 tok/s for 70B-class models).

How does Hypura compare to just using swap or mmap?
OS-level swap and mmap treat all memory pages equally — they have no awareness of model structure. This means thousands of small, random page faults per token, each incurring kernel overhead. Hypura knows which tensors are needed next and issues large sequential reads ahead of time, avoiding the fault storm entirely. For MoE models, it further reduces I/O by 75% through selective expert loading.

Will Hypura damage my SSD?
No. Hypura performs read-only operations during inference. SSD wear is caused by write cycles, and Hypura never writes to your storage. All tensor data is read from the GGUF file into RAM/GPU memory pools where computation happens.

Top AI Product

Leave a comment Cancel reply