Top AI Product

We track trending AI tools across Product Hunt, Hacker News, GitHub, and more  — then write honest, opinionated takes on the ones that actually matter. No press releases, no sponsored content. Just real picks, published daily.  Subscribe to stay ahead without drowning in hype.


Karpathy Built a Full ChatGPT Clone in 8,000 Lines of Code — Nanochat Hits 47K Stars

Andrej Karpathy has a habit of making complex things feel approachable. His nanoGPT project showed developers how pretraining works. His YouTube lectures became unofficial grad school for thousands. Now, with Nanochat, he’s taken the next logical step: a complete, end-to-end ChatGPT pipeline — tokenization, pretraining, finetuning, RLHF, inference, and a web UI — that you can run on a single cloud GPU node for about $100.

The repo has already cleared 47,000 GitHub stars and is sitting at the top of GitHub Trending. But what makes Nanochat interesting isn’t just the star count. It’s the argument embedded in the project itself: that the entire ChatGPT stack, from raw text to a working chatbot, can be compressed into roughly 8,000 lines of readable, hackable Python.

From nanoGPT to Nanochat: What Actually Changed

Karpathy’s earlier project, nanoGPT, was a pretraining-only affair. You could train a small GPT-2-class model, but that was where it stopped — no instruction following, no chat interface, no alignment step. It was a teaching tool for one slice of the LLM pipeline.

Nanochat covers the full stack. The training pipeline runs through six distinct stages:

  1. Tokenization — A custom BPE tokenizer built in Rust with a 65,536-token vocabulary
  2. Pretraining — Base model training on FineWeb-EDU data
  3. Midtraining — Adaptation to conversations using SmolTalk, plus explicit training on multiple-choice tasks (MMLU) and tool use via Python code execution blocks
  4. Supervised Fine-Tuning (SFT) — High-quality conversational examples with safety-oriented refusal behavior
  5. Reinforcement Learning — GRPO on GSM8K for math reasoning
  6. Inference & UI — CLI and web-based chat with KV cache-optimized generation

The architecture revolves around a single “complexity dial”: the --depth parameter. Set the number of transformer layers, and everything else — width, attention heads, learning rates, weight decay, training horizon — gets calculated automatically to produce a compute-optimal model. GPT-2-level capability lands around depth 24-26, which translates to roughly 560 million parameters.

The $100 Question: What Do You Actually Get?

The headline claim is that you can train a GPT-2-class chatbot for under $100. Here’s how the math breaks down:

An 8×H100 GPU node costs about $24/hour. The full training run takes approximately 2-3 hours. That puts the total somewhere between $48 and $72 at on-demand rates. On spot instances, costs drop to around $15.

For context, OpenAI spent an estimated $43,000 training GPT-2 in 2019. Seven years later, Nanochat reproduces that capability for less than the cost of a nice dinner.

The resulting model isn’t going to compete with GPT-4 or Claude. It’s a GPT-2-class system — useful for understanding how the pipeline works, not for replacing your production chatbot. But that’s the point. Nanochat isn’t trying to be a frontier model. It’s trying to be the best teaching tool for understanding how frontier models are built.

Performance-wise, the latest leaderboard entry (March 2026) shows a wall-clock training time of 1.80 hours with a CORE score of 0.2690, achieved through optimizations discovered by Karpathy’s companion project, autoresearch — where AI agents autonomously experiment with training configurations overnight.

What the Community Is Saying

The Hacker News thread on Nanochat drew significant discussion, and the reactions split along predictable lines.

The praise: Developers called it “the capstone project of LLM101n” — Karpathy’s long-running educational series. Multiple users reported successfully completing full training runs and uploading models to Hugging Face. Others built CPU-compatible inference scripts for macOS.

The technical debates: The adoption of the Muon optimizer (from Bernstein, with Keller Jordan’s implementation) over AdamW drew attention. Muon reduces memory requirements for the matrix multiplication parameters, while AdamW handles embeddings and unembeddings. Loss is reported in bits-per-byte rather than per-token metrics, making comparisons tokenizer-invariant — a small but meaningful design decision.

The meta-discussion: Perhaps the most talked-about detail was Karpathy’s own admission that AI coding assistants were “not helpful at all” for writing Nanochat. He described the code as “basically entirely hand-written with tab autocomplete,” noting that Claude and Codex agents were “too far off the data distribution” to contribute meaningfully. This sparked a broader debate about where AI coding tools actually work (CRUD apps and boilerplate) versus where they don’t (novel, non-standard codebases).

One commenter summed it up: experienced developers found the most value using AI tools for UI scaffolding while maintaining full control over algorithmic cores — a role-specialization model rather than full automation.

Nanochat vs. the Alternatives: Where Does It Fit?

Nanochat occupies a specific niche. It’s not the only way to train a small language model, but it makes a particular set of tradeoffs:

Nanochat Hugging Face TRL LitGPT Axolotl
Scope Full pipeline (tokenizer → RLHF → UI) Finetuning & alignment Pretraining & finetuning Finetuning focused
Lines of code ~8,000 Large framework Medium Medium
Target user Learners, researchers Production teams Researchers Practitioners
Configurability Minimal (by design) Extensive Moderate Extensive
Minimum cost ~$15 (spot) Varies Varies Varies

The key differentiator is intentional simplicity. Tools like TRL or Axolotl are designed for production flexibility. Nanochat is designed so you can read and understand the entire codebase in an afternoon. There are no abstractions hiding complexity — every stage of the pipeline is visible and modifiable.

This makes it ideal for a specific audience: ML engineers who want to understand the full LLM stack, students working through Karpathy’s educational content, and researchers who want a clean baseline to experiment from.

The Autoresearch Connection

In early March 2026, Karpathy released a companion project called autoresearch — a 630-line Python tool that gives an AI agent access to the Nanochat training setup and lets it run autonomous optimization experiments. The agent modifies code, trains for five minutes, checks if results improved, and repeats.

The results were notable. Autoresearch reduced Nanochat’s time-to-GPT-2 benchmark by 11%. Shopify CEO Tobi Lütke adapted the framework internally and reported a 19% improvement in validation scores, with the agent-optimized smaller model eventually outperforming a manually-configured larger one.

Karpathy has signaled plans for multi-agent parallelism in the next round, arguing that frontier labs could generalize this approach to optimize proxy metrics on small models before promoting winning ideas to larger scales.

FAQ

How much does it actually cost to train a Nanochat model?
The baseline cost is approximately $48 for a GPT-2-capability model on an 8×H100 node at on-demand rates. Using spot instances, the cost can drop to around $15. The full training run takes 2-3 hours.

Can I run Nanochat without H100 GPUs?
Yes. The codebase supports 8×A100 nodes, single GPUs, CPU-only training, and Apple Silicon (MPS). Training will take longer on less powerful hardware, but the pipeline works across all these configurations.

Is the resulting model good enough for production use?
No. Nanochat produces a GPT-2-class model — roughly 560M parameters. It’s designed for education and experimentation, not production deployment. For production chatbots, you’d want a much larger model with more extensive training data and alignment work.

How does Nanochat compare to nanoGPT?
nanoGPT covered only pretraining. Nanochat covers the entire LLM pipeline: tokenization, pretraining, midtraining, SFT, reinforcement learning, and inference with a chat UI. It’s the difference between learning one stage versus understanding the full system.

What programming languages and frameworks does Nanochat use?
Primarily Python with PyTorch, plus a small amount of Rust for the BPE tokenizer (built via Maturin). The codebase also includes Jupyter notebooks for exploration, a web-based chat UI, and optional Weights & Biases integration for experiment tracking.


You Might Also Like


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment