Google announced Gemini Omni at I/O 2026 — a new model series that combines Gemini’s reasoning capabilities with native video generation. The first release, Gemini Omni Flash, accepts image, audio, video, and text input and outputs video grounded in real-world knowledge that can be easily edited.
## What’s actually new
Most video generation models today are text-to-video or text-plus-image-to-video. Gemini Omni takes the full four-modality input (image + audio + video + text) and outputs video. The “grounded in real-world knowledge” angle leverages Gemini’s training corpus — meaning the model knows the rules of physics, the look of real cities, the way speech maps to mouth movement, without needing those facts to be specified in the prompt.
## The editing pitch
“Easily edited” is the headline difference versus Sora 2, Veo 3.1, and Kling. Generated video has historically been one-shot — re-rolling for changes burns expensive compute. Gemini Omni positions itself as edit-friendly, though Google hasn’t released specifics on how granular the editing controls actually are.
## Why it matters
This is Google’s direct response to a fragmented AI video market (Sora 2, Veo 3.1, Krea 2, Kling, Runway). Bundling video generation into the Gemini model lineup means existing Gemini API users can call video without picking a separate provider. Pricing and detailed rollout should follow over the next week.

Leave a comment