Tiny-vLLM is a fully functional LLM inference engine in C++ and CUDA — written deliberately small, with every kernel and every line of math walked through in a free companion course. It’s the “younger and smaller sibling” of vLLM, built for understanding rather than deployment.
## What’s actually in there
The implementation is not a toy. It includes a FlashAttention-like softmax, PagedAttention, and both static and continuous batching — the load-bearing pieces of any serious inference engine. The repo loads a real Llama 3.2 1B Instruct model from Safetensors and runs a full forward pass on GPU. That’s a long way from a CPU-only educational stub: you can actually see modern inference work end-to-end.
## A learning artifact, not a deployment
Production vLLM has hundreds of contributors and a large feature surface. Tiny-vLLM stays narrow on purpose so the code reads top-to-bottom. The author, Jędrzej Maczan, has paired the engine with a course explaining every decision, and plans follow-ups on ML compilers and alternative attention if interest holds.
## Why it matters
LLM inference has quietly become the part of the stack with the most concentrated knowledge gap: lots of engineers ship models, very few understand the kernels underneath. A readable C++/CUDA reference that implements the real techniques — not pseudocode — is the kind of thing the field needs more of. It’s also a quiet recruiting filter: engineers who can ship something this complete tend to be the engineers you want.

Leave a comment