A 397-billion-parameter model running at 4.4 tokens per second on a laptop with 48GB of RAM. No cloud API. No multi-GPU server. Just a MacBook Pro, an NVMe SSD, and about 7,000 lines of C and Metal code that nobody wrote by hand.
Flash-MoE landed on Hacker News and GitHub Trending this week, and the local LLM community is paying attention — because it tackles the single biggest constraint in consumer-grade AI inference: memory.
The Core Trick: Treat Your SSD Like Extended RAM
The idea behind Flash-MoE isn’t new. Apple published a paper called “LLM in a Flash” back in December 2023, describing how to run models larger than available DRAM by streaming weights from flash storage on demand. The paper showed 4-5x speedups on CPU and 20-25x on GPU compared to naive loading, and hinted at a future where phones and laptops could run models that technically don’t fit in memory.
That paper mostly targeted dense models and FFN sparsity. Dan Woods — the developer behind Flash-MoE — realized the technique maps even more naturally onto Mixture-of-Experts (MoE) architectures. Here’s why: Qwen3.5-397B-A17B has 512 experts per layer but only activates a handful per token. The vast majority of the model’s 209GB of weights sit idle during any given forward pass. Instead of loading everything into RAM, Flash-MoE reads only the active expert weights from NVMe SSD, processes them on the GPU via Metal compute shaders, and moves on.
The result: 5.5GB of resident memory for the non-expert components (embeddings, routing matrices, attention layers), while expert weights stream in at up to 17.5 GB/s from the SSD. The OS page cache handles the rest, maintaining a roughly 71% hit rate across a ~35GB cache window without any custom caching logic.
Performance Breakdown: Two Configurations, One Machine
Flash-MoE ships two quantization profiles, and the tradeoffs are instructive:
| Config | Disk Size | Speed | Quality |
|---|---|---|---|
| 4-bit | 209 GB | 4.36 tok/s | Full quality, tool calling works |
| 2-bit | 120 GB | 5.74 tok/s | Faster, but breaks JSON/tool calling |
The 4-bit configuration is the practical one. It preserves Qwen3.5-397B’s production capabilities — including structured output and function calling — at a speed that’s usable for real tasks. Under warm cache conditions, individual tokens can hit 7.05 tok/s.
A key finding: Qwen3.5 normally activates 10 experts per token, but Flash-MoE prunes this to 4 with no measurable quality loss. That’s a 60% reduction in expert data loaded per token, which directly translates to less SSD bandwidth consumed and faster inference.
Each layer takes about 4.28ms to process at 4-bit resolution, broken down as:
– GPU attention + delta-net computation: 1.22ms
– GPU projections + routing: 0.55ms
– SSD expert loading (4 experts in parallel): 2.41ms
The SSD loading dominates — which makes sense. The bottleneck isn’t compute; it’s how fast you can feed the GPU.
Zero Lines Written by Hand: The Autoresearch Story
Here’s the part that caught Simon Willison’s attention enough to write a detailed blog post on March 18: Dan Woods didn’t write any of the code himself. Not the ~5,000 lines of Objective-C inference engine. Not the ~1,100 lines of Metal shaders. Not the 2-bit requantization pipeline. Not the tests.
He fed Claude Code the Apple paper, some references on the Apple Neural Engine, and Andrej Karpathy’s autoresearch pattern. Claude ran 90 experiments over roughly 8 hours, iterating on optimizations and documenting 58 of them in detail. The project repo includes a full PDF paper — also largely written by Claude — describing the entire optimization journey.
Some of those experiments failed. LZ4 compression of expert weights added 13% overhead instead of saving time. Speculative decoding broke even at best. Temporal expert prediction (guessing which experts the next token would need) was 18% slower. The project documents these dead ends openly, which is arguably more valuable than the successes for anyone trying to replicate or extend the work.
The development pattern here — human provides the insight and reference materials, AI does the implementation and experimentation — is becoming increasingly common. But Flash-MoE is notable for the scope: a complete, working inference engine with hand-tuned GPU kernels, not just a prototype or proof of concept.
Under the Hood: Metal Shaders and Hardware Constraints
The technical depth of Flash-MoE goes well beyond “load weights from SSD.” The Metal compute pipeline includes hand-optimized kernels for:
- 4-bit and 2-bit dequantized matrix-vector multiplication — the core operation for running quantized models on Apple GPUs
- Fused SwiGLU activation — combining the gating and activation in a single kernel pass
- Two-pass RMS normalization — necessary for numerical stability at this scale
- Batched GPU attention with fused RoPE and deinterleaving
- FMA-optimized dequantization — rearranging the math from
(nibble * scale + bias) * xtofma(nibble, scale*x, bias*x)for a 12% speedup through fused multiply-add
One particularly interesting finding: on Apple Silicon, SSD DMA and GPU compute share the memory controller bandwidth. Running them simultaneously causes GPU latency spikes because they’re competing for the same bus. Flash-MoE’s solution is a serial pipeline — GPU compute first, then SSD load, then GPU again — which sounds counterintuitive but turned out to be hardware-optimal.
The entire stack uses zero external dependencies. No Python. No PyTorch. No MLX (despite early experiments with it). Just C, Objective-C, and Metal — compiled and linked directly. The code composition is roughly 59% Objective-C, 14% C, 7% Metal, with some Python utilities and TeX for the paper.
How Does Flash-MoE Compare to Existing Tools?
The local LLM inference ecosystem in 2026 is crowded. llama.cpp, Ollama, vLLM, and LM Studio all serve different niches. Where does Flash-MoE fit?
llama.cpp is the closest comparison. It has native Metal support and is the go-to for Apple Silicon inference. But llama.cpp is a general-purpose engine — it handles dozens of model architectures and quantization formats. Flash-MoE is laser-focused on one thing: streaming MoE expert weights from SSD. For that specific use case, it achieves something llama.cpp currently can’t: running a 397B model on 48GB of RAM.
Ollama wraps llama.cpp with a user-friendly interface and model management. It’s great for models that fit in memory. Flash-MoE solves the “doesn’t fit in memory” problem that Ollama sidesteps by offering smaller models.
vLLM targets server-side deployment with continuous batching and PagedAttention. It’s designed for throughput at scale on NVIDIA GPUs, not single-user inference on a laptop. Different problem entirely.
The honest comparison: Flash-MoE is not a replacement for any of these tools. It’s a specialized engine that proves a specific technical point — frontier-scale MoE models can run on consumer hardware by exploiting sparsity and fast storage. Whether this approach gets adopted into general-purpose frameworks like llama.cpp remains to be seen.
What Qwen3.5-397B Actually Brings to the Table
Running a 397B model locally matters only if the model is actually good. Qwen3.5-397B-A17B activates just 17B parameters per forward pass (out of 397B total), supports 256K context length across 201 languages, and benchmarks in the same tier as Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2.
Some numbers: 83.6 on LiveCodeBench v6, 91.3 on AIME 2026, and 76.5 on IFBench (beating GPT-5.2’s 75.4 on that particular benchmark). It’s released under Apache 2.0, making it fully open-weight and commercially usable.
Having a model this capable running locally — even at 4.4 tok/s — opens up use cases where latency matters less than privacy and cost. Long-running analysis tasks, document processing, local coding assistants that never send your code to an API endpoint.
Community Reception
Flash-MoE has picked up 679 GitHub stars and 85 forks since launch. Hacker News discussion hit 62 points with active technical debate. The r/LocalLLaMA community — now over 266,000 members — has been particularly engaged, since “running bigger models on less hardware” is essentially the subreddit’s mission statement.
Simon Willison’s March 18 blog post brought additional visibility, focusing on both the technical achievement and the autoresearch development methodology. The combination of “frontier model on a laptop” and “AI wrote all the code” makes for a compelling narrative, regardless of where you stand on either topic.
FAQ
What hardware do I need to run Flash-MoE?
The tested configuration is a MacBook Pro with an M3 Max chip, 48GB unified memory, and a 1TB NVMe SSD. The project requires Apple Silicon — it uses Metal compute shaders and relies on the high SSD bandwidth (17.5 GB/s) available on modern Macs. Intel Macs and non-Apple hardware are not supported.
Is Flash-MoE faster than running Qwen3.5-397B through an API?
No. Cloud inference of Qwen3.5-397B runs at 84.5 tokens per second on optimized infrastructure — roughly 20x faster than Flash-MoE’s 4.36 tok/s. The advantage isn’t speed; it’s privacy, zero per-token cost, and offline availability. If you need fast throughput, use an API. If you need local, private inference of a frontier-class model, Flash-MoE makes that possible.
Can I use Flash-MoE with models other than Qwen3.5-397B?
Currently, no. Flash-MoE is purpose-built for Qwen3.5-397B-A17B’s specific architecture (GatedDeltaNet + MoE with 512 experts). The underlying technique — SSD expert streaming for MoE models — could theoretically apply to other large MoE models, but the codebase would need adaptation for different architectures.
How does the 2-bit vs 4-bit quantization choice affect output quality?
The 4-bit configuration preserves full model quality including structured output and tool calling at 4.36 tok/s. The 2-bit configuration is faster (5.74 tok/s) and reduces disk usage from 209GB to 120GB, but it breaks JSON generation and tool calling. For any production-like use case, 4-bit is the recommended configuration.
Is this a practical daily-driver setup or more of a proof of concept?
Somewhere in between. At 4.4 tok/s, generating a 500-token response takes about two minutes — usable for tasks where you can wait, but not for interactive chat. The bigger significance is proving that the SSD-streaming approach works for MoE models on consumer hardware, which could influence how future inference engines handle models that exceed available RAM.
You Might Also Like
- Claude Channels Scores 375 Points on Hacker News Anthropics Play to Replace Openclaw
- George Hotz Wants to Sell you a 12000 ai Supercomputer and 221 Hacker News Comments Cant Stop Arguing About it
- Heretic Just hit Github Trending and the ai World has Opinions
- Pentagi Just hit 1 on Github Trending and Yeah its Worth the Hype
- Pageindex Just hit Github Trending and it Might Make you Rethink rag Entirely

Leave a comment