Here’s a question I didn’t know I cared about until this week: what’s the absolute smallest transformer model that can add two 10-digit numbers correctly? Not approximately, not most of the time — at least 99% accuracy on a proper test set. That’s the challenge behind [AdderBoard](https://github.com/anadim/AdderBoard), an open-source competitive leaderboard that’s been quietly consuming the attention of ML researchers and tinkerers alike.
The project was kicked off by Dimitris Papailiopoulos as an experiment he called “Addition Under Pressure.” He had Claude Code and Codex each try to build the smallest transformer they could for 10-digit addition. Claude Code came back with a 6,080-parameter model. Codex did it in 1,644. Both respectable, but then the community got involved, and things got wild. The current trained-weights record sits at just 311 parameters (by rezabyt, using rank-3 factorization and a grokking trick), while the hand-coded category has been pushed down to an absurd [36 parameters with 100% accuracy](https://github.com/anadim/AdderBoard). Let that sink in — 36 parameters to perfectly add any two 10-digit numbers.
What makes this more than just a fun code golf exercise is what it reveals about transformer internals. There’s a fascinating “parameter cliff” around 800 parameters where accuracy drops off a cliff — models above it work, models below it mostly don’t. Researchers also found that single-layer decoders beat two-layer models at the same parameter budget, which is counterintuitive if you think more layers always means more expressiveness. The whole thing forces you to think about what a transformer actually *needs* to do addition: digit alignment via attention, per-digit arithmetic via MLPs, and carry propagation via autoregressive generation.
The project [hit Hacker News today](https://news.ycombinator.com/item?id=47170030) and there’s active debate about whether training a transformer for basic arithmetic is even meaningful. Fair point, but I think that misses what makes AdderBoard compelling. It’s not about building a practical calculator. It’s about understanding the minimal computational structure required for a specific task — and that’s a question with real implications for model compression and efficiency research. Ziming Liu even wrote up a [deep-dive on 181-parameter models](https://kindxiaoming.github.io/blog/2026/digit-addition/) that’s worth reading if you want the math behind it.
If you enjoy competitive optimization puzzles or want a hands-on way to understand how transformers actually work under the hood, this is a great rabbit hole to fall into.

Leave a comment