SentrySearch Turns Hours of Video Into a Searchable Index for $2.50 — Using Google’s New Multimodal Embeddings

Searching through video footage has always been painful. Whether it’s dashcam recordings, security cameras, or raw production clips, the standard approach involves either scrubbing through hours of footage manually or relying on transcription-based pipelines that miss everything visual. SentrySearch, an open-source CLI tool that appeared on Hacker News this week, takes a fundamentally different approach: it uses Google’s Gemini Embedding 2 to project raw video pixels directly into a 768-dimensional vector space shared with text, enabling sub-second semantic search across hours of footage with nothing more than a natural language query.

The project hit 241 points and 68 comments on its Show HN post, with the discussion ranging from technical praise to heated debate about surveillance implications.

How SentrySearch Actually Works

The core idea is surprisingly simple. SentrySearch splits video files into overlapping chunks (default: 30 seconds with 5-second overlap), then sends each chunk directly to Gemini Embedding 2’s API. The model processes the raw video — no transcription, no frame captioning, no OCR, no intermediate text step — and returns a 768-dimensional vector that represents the visual and temporal content of that clip.

These vectors get stored in a local ChromaDB database. When you search, your text query (“red truck running a stop sign,” for example) gets embedded into the same vector space, and cosine similarity finds the closest matching clip. The tool then automatically trims the relevant segment from the original file and saves it.

The CLI workflow is three commands:

sentrysearch init — prompts for your Gemini API key and validates it
sentrysearch index /path/to/footage — processes and embeds all video files
sentrysearch search "your query here" — finds and extracts matching clips

Before embedding, chunks are automatically downscaled to 480p at 5fps to reduce API costs. A still-frame detection feature (enabled by default) skips chunks with no meaningful visual change, avoiding wasted API calls on footage of a parked car at 3 AM.

The Tech Behind It: Gemini Embedding 2’s Multimodal Trick

What makes SentrySearch possible is a capability that only became available in March 2026. Gemini Embedding 2 is Google’s first natively multimodal embedding model — it maps text, images, video, audio, and documents into a single unified embedding space. Previous approaches to video search typically required converting video to text first (via transcription or frame description), then embedding that text. The information loss was significant.

With Gemini Embedding 2, a 30-second clip of a cyclist weaving through traffic and a text query about “bicycle cutting between cars” land in the same mathematical neighborhood — without any text intermediary. The model supports video input up to 128 seconds per request in MP4 and MOV formats, and uses Matryoshka Representation Learning (MRL) to allow flexible dimension scaling from the default 3,072 down to 768.

Video input consumes approximately 258 tokens per second at one frame per second. At $0.20 per million tokens, SentrySearch’s creator estimates indexing costs at roughly $2.50 per hour of footage with default settings — a number that drew both interest and skepticism in the Hacker News comments.

What the Hacker News Community Is Saying

The 68-comment thread reveals a split reaction.

On the technical side, responses were largely positive. One commenter called it “a really cool implementation — embeddings still often feel like magic to me.” A professional working in content intelligence confirmed that “Gemini is great for this type of use case out of the box.” Several users expressed interest in adapting the tool for their own dashcam and security camera setups.

Practical limitations came up quickly, though. The creator was transparent about search precision: vague queries like “car cuts me off” returned poor results, while adding specific details (“car with a bike rack cuts me off”) worked much better. This aligns with how embedding-based retrieval generally works — specificity matters.

The most heated discussion was about surveillance implications. A top-voted comment raised concerns about a future where “AI is watching and indexing every second of every feed,” with references to real-world systems like Fusus (which integrates private cameras into law enforcement networks) and Flock automated license plate readers. The creator acknowledged these concerns and advocated for open-weight local models that could keep footage processing entirely on-device — a direction that would eliminate the cloud dependency and privacy trade-off.

SentrySearch vs. Existing Video Search Solutions

SentrySearch enters a space that already has established players, though it occupies a different niche.

Twelve Labs is the most direct comparison in terms of capability. Their Marengo model offers multimodal video search with cross-modality retrieval (text-to-video, image-to-video, audio-to-video). But Twelve Labs is a commercial SaaS platform targeting enterprise customers — you’re paying for a managed service with proprietary models. SentrySearch is a free, open-source CLI that runs locally (aside from the Gemini API calls).

OpenAI CLIP and its derivatives have been used for image-to-text matching, but CLIP operates on individual frames rather than understanding temporal video sequences. You’d need to build significant pipeline infrastructure to get frame-level CLIP embeddings working for video search, and you’d still lose motion and temporal context.

Traditional approaches using speech-to-text plus keyword search (common in security and media workflows) only catch what’s spoken or manually tagged. They’re blind to purely visual events — which is precisely what dashcam and security footage consists of.

Feature	SentrySearch	Twelve Labs	CLIP-based pipelines
Video understanding	Native multimodal	Native multimodal	Frame-level only
Temporal context	Yes (30s chunks)	Yes	No
Deployment	Local CLI + API	Cloud SaaS	Self-hosted
Cost model	Pay-per-embed (~$2.50/hr)	Subscription/usage	Compute costs
Open source	Yes	No	Varies

The key trade-off is clear: SentrySearch gives you the power of multimodal video search without vendor lock-in, but you’re dependent on Google’s Gemini API for the embedding step. The creator’s stated long-term vision of supporting open-weight local models would resolve that dependency.

Limitations and Open Questions

SentrySearch is early-stage software with some notable constraints:

Format support: Currently limited to MP4 files. The 128-second cap per API request means longer chunks need splitting.
Search quality: Highly dependent on query specificity. Abstract or ambiguous queries return poor results — this is an embedding model limitation, not a SentrySearch bug.
Cost at scale: $2.50/hour is reasonable for occasional use, but 24/7 security camera footage across multiple cameras would add up fast. A single camera recording continuously would cost roughly $60/day to index.
Cloud dependency: All embedding happens via Google’s API, meaning your video data is being sent to Google’s servers. For security-sensitive footage, this is a significant concern.
No re-ranking or filtering: The current implementation does a single-pass cosine similarity search with no metadata filtering, time-range constraints, or re-ranking pipeline.

FAQ

Is SentrySearch free to use?
The tool itself is open source and free. However, you need a Google Gemini API key, and the Gemini Embedding 2 API charges $0.20 per million tokens. Indexing costs roughly $2.50 per hour of video with default settings. Google does offer a free tier for testing.

Does SentrySearch only work with Tesla Sentry Mode footage?
No. Despite the name’s resemblance to Tesla’s Sentry Mode, SentrySearch works with any MP4 video files — dashcams, security cameras, screen recordings, or any other video source.

What are the best alternatives to SentrySearch?
For enterprise-grade video search, Twelve Labs offers a commercial SaaS platform with their Marengo model. For open-source frame-level search, CLIP-based pipelines are an option but lack temporal understanding. For a free, local-first tool with native video understanding, SentrySearch is currently unique in leveraging Gemini Embedding 2’s multimodal capabilities.

Can SentrySearch run entirely offline?
Not currently. The embedding step requires calls to Google’s Gemini API. The creator has expressed interest in supporting open-weight local models in the future, which would enable fully offline operation.

How accurate is the search?
Accuracy depends heavily on query specificity. Detailed, descriptive queries (“white SUV turning left at an intersection”) perform significantly better than vague ones (“car does something weird”). The tool returns the closest vector match, so there’s always a result — confidence thresholds for filtering low-quality matches are still being discussed as a potential feature.

Top AI Product

Leave a comment Cancel reply