Searching through video footage has always been painful. Whether it’s dashcam recordings, security cameras, or raw production clips, the standard approach involves either scrubbing through hours of footage manually or relying on transcription-based pipelines that miss everything visual. SentrySearch, an open-source CLI tool that appeared on Hacker News this week, takes a fundamentally different approach: it uses Google’s Gemini Embedding 2 to project raw video pixels directly into a 768-dimensional vector space shared with text, enabling sub-second semantic search across hours of footage with nothing more than a natural language query.
The project hit 241 points and 68 comments on its Show HN post, with the discussion ranging from technical praise to heated debate about surveillance implications.
How SentrySearch Actually Works
The core idea is surprisingly simple. SentrySearch splits video files into overlapping chunks (default: 30 seconds with 5-second overlap), then sends each chunk directly to Gemini Embedding 2’s API. The model processes the raw video — no transcription, no frame captioning, no OCR, no intermediate text step — and returns a 768-dimensional vector that represents the visual and temporal content of that clip.
These vectors get stored in a local ChromaDB database. When you search, your text query (“red truck running a stop sign,” for example) gets embedded into the same vector space, and cosine similarity finds the closest matching clip. The tool then automatically trims the relevant segment from the original file and saves it.
The CLI workflow is three commands:
sentrysearch init— prompts for your Gemini API key and validates itsentrysearch index /path/to/footage— processes and embeds all video filessentrysearch search "your query here"— finds and extracts matching clips
Before embedding, chunks are automatically downscaled to 480p at 5fps to reduce API costs. A still-frame detection feature (enabled by default) skips chunks with no meaningful visual change, avoiding wasted API calls on footage of a parked car at 3 AM.
The Tech Behind It: Gemini Embedding 2’s Multimodal Trick
What makes SentrySearch possible is a capability that only became available in March 2026. Gemini Embedding 2 is Google’s first natively multimodal embedding model — it maps text, images, video, audio, and documents into a single unified embedding space. Previous approaches to video search typically required converting video to text first (via transcription or frame description), then embedding that text. The information loss was significant.
With Gemini Embedding 2, a 30-second clip of a cyclist weaving through traffic and a text query about “bicycle cutting between cars” land in the same mathematical neighborhood — without any text intermediary. The model supports video input up to 128 seconds per request in MP4 and MOV formats, and uses Matryoshka Representation Learning (MRL) to allow flexible dimension scaling from the default 3,072 down to 768.
Video input consumes approximately 258 tokens per second at one frame per second. At $0.20 per million tokens, SentrySearch’s creator estimates indexing costs at roughly $2.50 per hour of footage with default settings — a number that drew both interest and skepticism in the Hacker News comments.
What the Hacker News Community Is Saying
The 68-comment thread reveals a split reaction.
On the technical side, responses were largely positive. One commenter called it “a really cool implementation — embeddings still often feel like magic to me.” A professional working in content intelligence confirmed that “Gemini is great for this type of use case out of the box.” Several users expressed interest in adapting the tool for their own dashcam and security camera setups.
Practical limitations came up quickly, though. The creator was transparent about search precision: vague queries like “car cuts me off” returned poor results, while adding specific details (“car with a bike rack cuts me off”) worked much better. This aligns with how embedding-based retrieval generally works — specificity matters.
The most heated discussion was about surveillance implications. A top-voted comment raised concerns about a future where “AI is watching and indexing every second of every feed,” with references to real-world systems like Fusus (which integrates private cameras into law enforcement networks) and Flock automated license plate readers. The creator acknowledged these concerns and advocated for open-weight local models that could keep footage processing entirely on-device — a direction that would eliminate the cloud dependency and privacy trade-off.
SentrySearch vs. Existing Video Search Solutions
SentrySearch enters a space that already has established players, though it occupies a different niche.
Twelve Labs is the most direct comparison in terms of capability. Their Marengo model offers multimodal video search with cross-modality retrieval (text-to-video, image-to-video, audio-to-video). But Twelve Labs is a commercial SaaS platform targeting enterprise customers — you’re paying for a managed service with proprietary models. SentrySearch is a free, open-source CLI that runs locally (aside from the Gemini API calls).
OpenAI CLIP and its derivatives have been used for image-to-text matching, but CLIP operates on individual frames rather than understanding temporal video sequences. You’d need to build significant pipeline infrastructure to get frame-level CLIP embeddings working for video search, and you’d still lose motion and temporal context.
Traditional approaches using speech-to-text plus keyword search (common in security and media workflows) only catch what’s spoken or manually tagged. They’re blind to purely visual events — which is precisely what dashcam and security footage consists of.
| Feature | SentrySearch | Twelve Labs | CLIP-based pipelines |
|---|---|---|---|
| Video understanding | Native multimodal | Native multimodal | Frame-level only |
| Temporal context | Yes (30s chunks) | Yes | No |
| Deployment | Local CLI + API | Cloud SaaS | Self-hosted |
| Cost model | Pay-per-embed (~$2.50/hr) | Subscription/usage | Compute costs |
| Open source | Yes | No | Varies |
The key trade-off is clear: SentrySearch gives you the power of multimodal video search without vendor lock-in, but you’re dependent on Google’s Gemini API for the embedding step. The creator’s stated long-term vision of supporting open-weight local models would resolve that dependency.
Limitations and Open Questions
SentrySearch is early-stage software with some notable constraints:
- Format support: Currently limited to MP4 files. The 128-second cap per API request means longer chunks need splitting.
- Search quality: Highly dependent on query specificity. Abstract or ambiguous queries return poor results — this is an embedding model limitation, not a SentrySearch bug.
- Cost at scale: $2.50/hour is reasonable for occasional use, but 24/7 security camera footage across multiple cameras would add up fast. A single camera recording continuously would cost roughly $60/day to index.
- Cloud dependency: All embedding happens via Google’s API, meaning your video data is being sent to Google’s servers. For security-sensitive footage, this is a significant concern.
- No re-ranking or filtering: The current implementation does a single-pass cosine similarity search with no metadata filtering, time-range constraints, or re-ranking pipeline.
FAQ
Is SentrySearch free to use?
The tool itself is open source and free. However, you need a Google Gemini API key, and the Gemini Embedding 2 API charges $0.20 per million tokens. Indexing costs roughly $2.50 per hour of video with default settings. Google does offer a free tier for testing.
Does SentrySearch only work with Tesla Sentry Mode footage?
No. Despite the name’s resemblance to Tesla’s Sentry Mode, SentrySearch works with any MP4 video files — dashcams, security cameras, screen recordings, or any other video source.
What are the best alternatives to SentrySearch?
For enterprise-grade video search, Twelve Labs offers a commercial SaaS platform with their Marengo model. For open-source frame-level search, CLIP-based pipelines are an option but lack temporal understanding. For a free, local-first tool with native video understanding, SentrySearch is currently unique in leveraging Gemini Embedding 2’s multimodal capabilities.
Can SentrySearch run entirely offline?
Not currently. The embedding step requires calls to Google’s Gemini API. The creator has expressed interest in supporting open-weight local models in the future, which would enable fully offline operation.
How accurate is the search?
Accuracy depends heavily on query specificity. Detailed, descriptive queries (“white SUV turning left at an intersection”) perform significantly better than vague ones (“car does something weird”). The tool returns the closest vector match, so there’s always a result — confidence thresholds for filtering low-quality matches are still being discussed as a potential feature.
You Might Also Like
- Google Releases Gemini Embedding 2 one Vector Space for Text Images Video and Audio
- Gemini Canvas in ai Mode Google Just Turned Search Into a Creative Workspace
- Google Workspace cli Google Finally Built the one Tool Every Developer Wanted
- Claude Channels Scores 375 Points on Hacker News Anthropics Play to Replace Openclaw
- Google A2ui Agent to User Interface Finally a Standard way for ai Agents to Show you Things

Leave a comment