If you’ve ever tried building a voice agent, you know the pain. The user says something, there’s an awkward pause while your pipeline processes speech-to-text, then calls an LLM, then converts the response back to audio — and by the time the agent starts talking, the whole thing feels like a bad phone call with someone on the other side of the planet. That’s exactly the problem [Shuo](https://github.com/NickTikhonov/shuo) tackles, and it does so in a surprisingly lean package.
Built by Nick Tikhonov, Shuo is an open-source voice agent orchestration framework written in roughly 600 lines of Python. The headline claim is sub-500ms end-to-end latency — about twice as fast as platforms like Vapi. And having dug through both the [codebase](https://github.com/NickTikhonov/shuo) and Nick’s [detailed write-up](https://www.ntik.me/posts/voice-agent) on how he built it, the numbers actually hold up.
The trick isn’t any single magic component. Shuo stitches together Deepgram Flux for streaming speech recognition, Groq’s llama-3.3-70b for inference (pulling off around 80ms time-to-first-token), and ElevenLabs for text-to-speech. What makes it fast is how these pieces connect. Instead of running STT, then LLM, then TTS as separate sequential steps, Shuo treats the entire pipeline as a continuous stream. The moment Deepgram starts producing a transcript, it’s already feeding into the LLM. The first token out of the LLM immediately hits the TTS WebSocket. Audio packets go straight to the caller via Twilio. There’s no waiting around.
The interrupt handling is aggressive too — when a user starts talking mid-response, the system kills the LLM generation, tears down the TTS connection, and flushes Twilio’s audio buffer all at once. It feels responsive in a way that most voice agents simply don’t.
Geography turned out to be a huge factor. Nick found that deploying the orchestration server to Railway EU, close to the API providers, cut total latency nearly in half compared to running things locally. Connection pooling for TTS saved another 300ms per response. These are the kinds of practical insights you only get from actually building the thing.
The project blew up on [Hacker News](https://news.ycombinator.com/item?id=47224295) with 234 points and 66 comments, and got picked up by BestOfShowHN. The discussion thread is worth reading — lots of back-and-forth about turn detection, the tradeoffs of different LLM providers, and whether sub-500ms is the threshold where voice AI stops feeling laggy.
What I appreciate most is that the entire state machine fits in about 30 lines of code. It’s a pure function — `process_event(state, event)` returns a new state and a list of actions. No hidden complexity, no framework lock-in. If you want to swap out Groq for another provider or replace ElevenLabs with something else, the architecture makes that straightforward.
For anyone building voice-first products, Shuo is worth studying. Not just as a tool to use, but as a blueprint for how to think about latency in streaming pipelines.

Leave a comment