Six months ago, running a real-time voice AI locally required an RTX 5090. Now a 2-billion-parameter model on an M3 Pro does voice, vision, and conversation at 83 tokens/sec. That’s the entire pitch behind Parlor — and it’s more impressive than it sounds.
Parlor grabbed 265 points on Show HN, landed #6 on bestofshowhn.com’s April chart, and sits at 777 GitHub stars. For a project that’s essentially “what if I duct-taped Gemma 4 E2B to Kokoro TTS and pointed a webcam at it,” that’s serious traction.
What Parlor Actually Does
You open a browser tab. Your mic and camera turn on. You talk. The AI sees what your camera sees, hears what you say, thinks for about 2.5 seconds, and talks back. No push-to-talk — Silero VAD handles voice activity detection. You can interrupt it mid-sentence (barge-in). Audio starts streaming sentence-by-sentence before the full response finishes generating.
The entire stack: Gemma 4 E2B for multimodal understanding, Kokoro (82M params) for TTS, FastAPI + WebSocket backend, plain HTML/JS frontend. Total model weight: ~2.6 GB. That’s it.
The Latency Breakdown
On an M3 Pro, end-to-end response time is 2.5–3.0 seconds. Speech + vision processing takes 1.8–2.2s, response generation ~0.3s for 25 tokens, TTS another 0.3–0.7s. Not instant, but fast enough for natural conversation. Sentence-level streaming makes it feel faster than the raw numbers suggest.
For comparison, cloud-based alternatives like ChatGPT Advanced Voice or Gemini Live hit sub-second latency — but they require internet, send your audio and video to someone else’s servers, and cost money at scale. Parlor’s latency tax buys you something those can’t: complete privacy and zero recurring cost.
Why This Matters Beyond a Demo
Parlor is explicitly a “research preview.” Rough edges everywhere. But it’s the first real proof-of-concept showing Gemma 4 E2B doing what Google designed it for — native audio + vision multimodality on consumer hardware.
The local AI movement has been mostly text. Ollama’s MLX backend made local text inference fast on Mac. Whisper solved local transcription. But real-time multimodal conversation — voice in, camera in, voice out — was still cloud-only territory. Parlor cracks that open.
The creator built it because the economics of hosted voice AI don’t scale. Every API call costs money. A laptop with a good enough chip costs money once. For use cases like language learning, accessibility tools, or always-on assistants, the math tilts local fast.
Apache 2.0 license. Python 3.12+. Mac with Apple Silicon or Linux with GPU. Clone, install, run.
You Might Also Like
- Google A2ui Agent to User Interface Finally a Standard way for ai Agents to Show you Things
- Notebooklm py the Unofficial Python sdk That Finally Gives Google Notebooklm a Real api
- Parlor Puts Real Time Voice and Vision ai on Your Laptop 2 6 gb no Cloud no api Keys
- Parlor Runs Real Time Voice Vision ai on a Macbook 2 6 gb Zero Cloud Zero api Keys
- Kani tts 2 Just Dropped and it Only Needs 3gb of Vram to Clone Your Voice

Leave a comment