
For most enterprises, creating a 90-second training video or product explainer has never been an easy task. This means a well-planned brief, an internal film crew or an outside vendor, a shoot, an edit, and a round of revisions. Change one line of on-screen text due to legal review and the entire series starts over. Not much internal video is ever made due to cost and long deadlines.
That’s the equation Google is aiming to rewrite with its new model, the Gemini Omni Flash "omni" The family, after debuting for consumers at I/O 2026, is now reaching developers and enterprise customers via API. Google presents the family’s ambition to build anything "From any input," Starting with video. But headline interaction isn’t just a clear text-to-video signal. This is the ability to edit a finished clip through conversation.
When the model launched in May, VentureBeat’s enterprise analysis flagged the catch: With no programmatic interface, the Omni was a consumer and consumer device, not a production one. This API rollout changes that. This puts conversation editing in front of the marketing and learning-and-development teams who create the most video in an organization.
The pitch: A five-tool pipeline collapsed into a single conversation
By now, multiple teams are painstakingly assembling AI video, with an LLM for a script, a text-to-image model, an image-to-video model, a separate lip-sync tool, and a voice generator, each with its own contract, billing, and data path.
Omni has enterprise logic integration: a model that takes text, images, and video and returns a finished clip with synced audio.
That simplicity factor is the part that decision makers should consider first. Consolidating multiple point tools into one model means fewer vendors and a single location to monitor output and enforce data-handling rules. For an organization that has avoided generative video because linking the devices together wasn’t worth the overhead, the equation changes.
With conversational editing each instruction builds on the last, so a marketer can retouch a product shot, reframe it, or change a wardrobe without recreating it from scratch and losing the parts that already work. It’s the difference between booking a reshoot and sending a note.
Multimodal context and a physics engine for brand assets
Omni accepts much more than text prompts. As well as words describing what you want, you can feed it several reference images and existing video clips, and it takes those specifications into the result. Hand it a photo of a particular object, ask the model to place that object in a scene, and it reproduces the color and rough shape of the real thing instead of inventing generic stand-ins. Although the match may not be pixel-perfect, it is close enough to be recognizable. That context-driven control is what makes the feature commercially interesting: a product photo, a brand logo, or a specific location can be left in as a component rather than described and expected in a sign.
Two of Google’s four highlighted strengths relate directly to enterprise work. The first is a world model, a system’s understanding of how physical phenomena behave. Add light rain and puddles to an existing shot and it produces reflections of people and objects in wet pavement, the kind of physical consistency that separates real footage from obvious AI video.
The second is text and logo insertion. Point it at a scene full of signage and you can rewrite those signs in another language, or for the brand of your choice, and even insert a company logo. The results aren’t flawless: In testing, sign tracking wasn’t always perfect in complex scenes and some text reverted to the original language between frames. For training videos that require on-screen labels, or ads that need to place a logo in the scene, it’s a capability worth taking a close look at, and a reminder that the output still needs a human review before shipping.
Interaction APIs and where boundaries still intersect
Under the hood, it runs on Google’s new Interactions API, a stateful interface built for multi-turn tasks rather than open-ended chat. Each turn builds on the previous video and its context, allowing edits to stack together coherently. Developers can create a series of generations. They can prepare a clip, edit a cat into a puma kitten, convert a video to an 8-bit retro and then a watercolor look, and store each version in the branch for later.
The obstacles are real and worth budgeting for. According to the model’s published model card the maximum limit for clips is currently 10 seconds. To make something longer, you create pieces and edit them together. Uploaded footage can also be edited, as long as it lasts 10 seconds or less and the user has the rights to do so. Google’s own model card is clear that maintaining consistency in edits and rendering accurate text remain open issues.
Guardrails, watermarking, and the line Google won’t cross
For CISOs, provenance work matters less than shipping demo models as well. Every Omni Clip carries Google’s SynthID watermark, Google is expanding C2PA content credentials to its generative tools, and it has launched an AI content detection API that flags AI-generated media from both Google and other vendors.
Google has also drawn a thoughtful line. The model won’t take a still photo and audio clip of a person and lip-sync them to speech, which is an obvious move to limit deepfakes. However, it will take a recording of someone talking and translate it into another language, which is a useful way to localize global training content. For regulated enterprises, those barriers and origins are features rather than frictions.
The Numbers: Cheap, only 720p, and (initially) in first place
The pricing happened along with the API, and it’s aggressive. The cost per second of 720p video produced by Omni Flash is $0.10, which translates to about a dollar for a ten-second clip. It matches Veo 3.1 fast at the same resolution, runs twice as fast as Veo 3.1 Lite, and undercuts the standard Veo 3.1 by three-quarters.
| per second (USD) |
gemini omni flash |
vo 3.1 lite |
vo 3.1 fast |
VO 3.1 |
|
720p |
$0.10 |
$0.05 |
$0.10 |
$0.40 |
|
1080p |
N/A |
$0.08 |
$0.12 |
$0.40 |
|
4K |
N/A |
N/A |
$0.30 |
$0.60 |
However, the table also highlights the catch. Omni Flash only produces 720p. There’s no 1080p or 4K option, while the Veo tier scales up to 4K. For internal training and most social videos, 720p is fine. For premium brands working on larger screens, this is a real ceiling, and that’s why Veo 3.1 still has work to do
Clips run 3 to 10 seconds in 720p native, landscape (16:9) or portrait (9:16). The model accepts seven images and three video clips of three seconds or less as reference input. It does not yet take audio as input, although it produces audio along with the video it produces. The output is standard MP4, and each clip comes with SynthID watermarking and C2PA credentials.
In terms of quality, the early signal is strong. In LMArena’s Text-to-Video Arena, a leaderboard where people vote on the output of competing models head-to-head, Omni Flash ranked number one with a score of 1527.
What this means for the budget, and what’s still missing
With realistic pricing, the recoupment story becomes solid. Every interactive edit is a new generation you pay for, so an editing-heavy session still adds up, about a dollar for each ten-second pass at 720p. What changes in the stateful model isn’t the cost of an edit, it’s the number of people wasted: because context changes, those generations go toward refining a take that mostly works, rather than restarting from an empty prompt and hoping the next attempt succeeds.
Omni is not alone in this area. Veo 3.1 remains Google’s production-grade option when you need higher resolution, and rivals from ByteDance, Alibaba and OpenAI are all chasing similar budgets. What Omni has added is self-editing capability: the ability to present a video as a living document rather than a one-shot render.
<a href