James O’Beirne dumped everything he knows about running LLMs locally into one repo. It hit 381 points on HN’s front page, and it’s not a framework or an app — it’s a reference. One giant hardware × model cheat sheet for 2026 self-hosting, plus ready-to-run vLLM configs.
What you actually get
A price-to-tokens table you can act on. A single 3090 or 4090 (~$2–3K) runs Qwen 3.6-27B at 68–80 tok/s. Dual 3090s or a 128GB M-series MacBook pushes past 150 tok/s. The flex tier: four RTX 6000 Pro Blackwells (384GB VRAM, ~$46K) running GLM-5.2 quantized with expert pruning, 240K+ context, ~80 tok/s — what O’Beirne calls “close to Claude Opus.” He even wired PCIe4 switches so the GPUs talk directly during tensor parallelism.
The honest catch
The top HN comment does the math nobody wants to hear: $40K in silicon plus power and upkeep dwarfs a $200/month subscription. Local inference isn’t cheaper. You’re paying for privacy and control, full stop. That framing is why this guide is worth bookmarking — it tells you exactly what your money buys before you spend it.
You Might Also Like
- Qwen 3 6 Plus vs Claude Opus 4 6 3x the Speed 1 17th the Price and the Benchmarks are Uncomfortably Close
- 26 Engineers 20m Arcee ai Trinity Large Thinking Scores Within 2 Points of Claude Opus
- Kimi k2 6 Beats gpt 5 4 and Claude Opus 4 6 on swe Bench pro
- Deepclaude Lets Claude Code run on Deepseek v4 pro 0 87 vs 15 per Million Tokens
- Title ds4 Deepseek v4 Metal Local Inference Engine by Antirez Redis Creator Runs v4 Flash on a Single Macbook

Leave a comment