Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


Parlor Runs a Full Voice + Vision AI on Your MacBook — No API Key, No Cloud, 2.6 GB Total

Six months ago, running a real-time voice AI locally required an RTX 5090. Now a 2-billion-parameter model on an M3 Pro does voice, vision, and conversation at 83 tokens/sec. That’s the entire pitch behind Parlor — and it’s more impressive than it sounds.

Parlor grabbed 265 points on Show HN, landed #6 on bestofshowhn.com’s April chart, and sits at 777 GitHub stars. For a project that’s essentially “what if I duct-taped Gemma 4 E2B to Kokoro TTS and pointed a webcam at it,” that’s serious traction.

What Parlor Actually Does

You open a browser tab. Your mic and camera turn on. You talk. The AI sees what your camera sees, hears what you say, thinks for about 2.5 seconds, and talks back. No push-to-talk — Silero VAD handles voice activity detection. You can interrupt it mid-sentence (barge-in). Audio starts streaming sentence-by-sentence before the full response finishes generating.

The entire stack: Gemma 4 E2B for multimodal understanding, Kokoro (82M params) for TTS, FastAPI + WebSocket backend, plain HTML/JS frontend. Total model weight: ~2.6 GB. That’s it.

The Latency Breakdown

On an M3 Pro, end-to-end response time is 2.5–3.0 seconds. Speech + vision processing takes 1.8–2.2s, response generation ~0.3s for 25 tokens, TTS another 0.3–0.7s. Not instant, but fast enough for natural conversation. Sentence-level streaming makes it feel faster than the raw numbers suggest.

For comparison, cloud-based alternatives like ChatGPT Advanced Voice or Gemini Live hit sub-second latency — but they require internet, send your audio and video to someone else’s servers, and cost money at scale. Parlor’s latency tax buys you something those can’t: complete privacy and zero recurring cost.

Why This Matters Beyond a Demo

Parlor is explicitly a “research preview.” Rough edges everywhere. But it’s the first real proof-of-concept showing Gemma 4 E2B doing what Google designed it for — native audio + vision multimodality on consumer hardware.

The local AI movement has been mostly text. Ollama’s MLX backend made local text inference fast on Mac. Whisper solved local transcription. But real-time multimodal conversation — voice in, camera in, voice out — was still cloud-only territory. Parlor cracks that open.

The creator built it because the economics of hosted voice AI don’t scale. Every API call costs money. A laptop with a good enough chip costs money once. For use cases like language learning, accessibility tools, or always-on assistants, the math tilts local fast.

Apache 2.0 license. Python 3.12+. Mac with Apple Silicon or Linux with GPU. Clone, install, run.


You Might Also Like


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment