Top AI Product

We track trending AI tools across Product Hunt, Hacker News, GitHub, and more  — then write honest, opinionated takes on the ones that actually matter. No press releases, no sponsored content. Just real picks, published daily.  Subscribe to stay ahead without drowning in hype.


ntransformer Just Made Running Llama 70B on a Single RTX 3090 Actually Possible

There’s something deeply satisfying about watching a 70-billion-parameter model run on hardware that was never supposed to handle it. That’s exactly what [ntransformer](https://github.com/xaskasdf/ntransformer) pulls off — Llama 3.1 70B on a single RTX 3090 with just 24GB of VRAM. No multi-GPU rigs, no cloud rentals, no $10k server builds. Just your gaming card and a clever piece of engineering.

The trick is deceptively simple in concept but brutal in execution. Instead of the traditional path where data hops from NVMe to CPU to system RAM and finally to the GPU, ntransformer cuts the middleman. It uses a userspace NVMe driver to shuttle model weights directly from your SSD into pinned GPU-accessible memory through DMA. The data path looks like this: NVMe SSD straight to pinned staging, then via PCIe to GPU buffers, then compute. The CPU barely has to lift a finger.

The project hit [GitHub Trending](https://github.com/xaskasdf/ntransformer) today and sparked a lively thread on [Hacker News](https://news.ycombinator.com/item?id=47104667) with 86 points and 22 comments. Folks there were quick to point out that the current speed — around 0.2 tokens per second for 70B Q6_K — isn’t exactly conversational. And honestly, they’re right. You won’t be having snappy back-and-forth chats at that speed. But the creator claims an 83x speedup over naive mmap baselines, and with the layer-skip optimization (which drops redundant layers using cosine similarity), they’ve pushed it to 0.5 tok/s for Q4_K_M. Not blazing fast, but remember — this is a 70B model on a single consumer GPU. The fact it works at all is remarkable.

What I find most interesting is the 3-tier adaptive caching system. ntransformer automatically figures out how to distribute model layers across VRAM, pinned system RAM, and NVMe storage based on what your hardware actually has available. It’s not just brute-forcing everything through the SSD path. If layers fit in VRAM, they stay resident at full speed (the 8B model hits nearly 49 tok/s in resident mode). It only falls back to the NVMe pipeline when it absolutely has to.

Fair warning though: this isn’t a plug-and-play affair. You’re dealing with low-level PCIe operations, VFIO passthrough, and raw NVMe block writes. The README explicitly warns about potential NVMe link failures and data loss if misconfigured. Don’t run this on your boot drive. It requires Linux with kernel 6.17+, CUDA Toolkit 13.1, and some GRUB/DKMS patching. This is enthusiast-grade stuff.

Still, the implications are huge. If the community can push those token rates up — and given how fast open-source AI tooling moves, I wouldn’t bet against it — this approach could make 70B-class models accessible to anyone with a decent gaming rig and a spare NVMe drive. Keep an eye on this one.


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment