Training a 100-billion-parameter model usually means a cluster of expensive GPUs. MegaTrain flips the script: store everything in CPU memory, and treat the GPU as a temporary math worker.
How It Works
The core idea is dead simple. Parameters and optimizer states live in host RAM. During forward and backward passes, MegaTrain streams weights to the GPU layer by layer through a double-buffered pipeline — one layer computes while the next one loads. The GPU never holds the full model. Once a layer finishes, its memory is freed immediately.
On a single NVIDIA GH200 with 1.5TB host memory, MegaTrain hit 1.84x the throughput of DeepSpeed ZeRO-3 for a 14B model. It supports any HuggingFace decoder-only transformer out of the box — Llama, Qwen, Mistral, DeepSeek, you name it.
The Catch
The Hacker News crowd (239 points, 44 comments) was quick to point out: “single GPU” sounds scrappy until you realize the test rig has 1.5TB of RAM. That’s not your gaming PC. And at 341 tokens per second on a 14B model, full pretraining from scratch would take a geological amount of time.
The real value here is fine-tuning, not pretraining. If you have a beefy workstation with a lot of RAM but only one GPU, MegaTrain lets you fine-tune models that would otherwise require multi-GPU setups. That’s a genuine cost saver.
You Might Also Like
- Microsoft Bitnet 100b Parameters on a Single cpu 0 4 gb of Memory Zero Gpus
- Kitten tts 15 Million Parameters 25mb on Disk and Zero gpu Required for Natural Speech
- Stripe Minions 1300 Pull Requests a Week Zero Human Written Code
- Claude Code Security Just Dropped and it Already Found 500 Zero Days Nobody Knew About
- Mercury 2 Just hit 1000 Tokens per Second and its not Even Using Transformers

Leave a comment