Top AI Product

We track trending AI tools across Product Hunt, Hacker News, GitHub, and more  — then write honest, opinionated takes on the ones that actually matter. No press releases, no sponsored content. Just real picks, published daily.  Subscribe to stay ahead without drowning in hype.


OpenGraviton Just Let Me Run a 140B Model on My Mac Mini — Here’s How

I’ve been chasing the dream of running truly massive language models locally for a while now. Not the 7B or 13B stuff — I mean the big ones, the 100B+ parameter beasts that usually demand a rack of A100s. So when [OpenGraviton](https://opengraviton.github.io) popped up on [Hacker News Show HN](https://news.ycombinator.com) and got picked up by bestofshowhn.com, I had to try it.

The pitch sounds almost too good: run 500B+ parameter models on consumer hardware. A Mac Mini. Your laptop. No cloud bills, no NVIDIA tax. And honestly? It mostly delivers. I tested it on an M1 Max with 64GB of RAM, loading up a 140B parameter model that would normally eat 280GB of memory. OpenGraviton’s ternary quantization crunched it down to about 35GB — small enough to actually fit. The trick is their 1.58-bit approach, where weights collapse to just {-1, 0, +1}. That’s a 10x compression ratio, which is wild.

But quantization alone isn’t the whole story. The engine also does dynamic sparsity pruning, skipping over 70% of computations per token through Top-K zeroing and MoE routing. Then there’s the layer streaming via mmap — it pulls model layers directly from your NVMe SSD, so you’re not bottlenecked by RAM in the traditional sense. Stack speculative decoding on top of that and you get roughly 2-3x faster generation than you’d expect.

Setting it up from the [GitHub repo](https://github.com/opengraviton/graviton) is straightforward. Clone, run the hardware check with `python3 -m graviton.cli.main info`, and you’re off. It supports both macOS and Linux, ships under Apache 2.0, and handles models like Mixtral-8x22B out of the box.

Is there a catch? Sure. Quality takes a hit at these extreme quantization levels — you’ll notice some degradation compared to full-precision inference, especially on nuanced reasoning tasks. And throughput on the M1 Max isn’t going to match a proper GPU cluster. But for local experimentation, privacy-sensitive workloads, or just the sheer satisfaction of running massive models without a cloud account, OpenGraviton is the most impressive thing I’ve seen in a while. The local AI inference space just got a lot more interesting.


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment