Visual Imitation Learning: Guidde Trains AI Agents On Human 'expert Video' Instead Of Documentation

Over the years, "last mile" Digital transformation is littered with forgotten PDFs and neglected training manuals.

Organizations spend millions on sophisticated software like SAP or Salesforce, only to have employees struggle with basic navigation. Now, as the era of agentic AI arrives, companies face a double-edged sword: They must teach human employees to collaborate with AI, while also teaching AI agents to navigate the labyrinthine interfaces of the modern enterprise.

One idea that’s gaining momentum among AI-forward businesses: using a screen recording and tutorial/walkthrough of someone performing an enterprise task – whether it’s creating a new ticket or processing an invoice – and training AI to replicate the flow based on the screen capture. Just this week, a startup called Standard Intelligence went viral on X by showing an early demo of its open-ended version for the physical and digital worlds.

But the truth is that there are players already tackling this problem for the enterprise: case-in-point, guideAn Israeli startup born during the video-centric years of the COVID-19 pandemic today announced a Series B funding round of over $50 million led by PSG Equity to address this precision knowledge infrastructure crisis.

Instead of feeding a static PDF manual to an agent, Guidee provides high-fidelity "video ground truth"-A rich stream of data from real human experts while navigating complex software.

The investment signals a shift in how the tech industry views documentation – not as a static byproduct of work, but as the vital telemetry needed to train the next generation of autonomous digital agents.

Technology: from video capture to world models

At its core, Guide is an AI Digital Adoption Platform (ADAP). However, its technical success lies in what happens behind the scenes during recording.

The guide isn’t just recording pixels; It is capturing every click, scroll and latent interaction with the HTML page-Subtle pauses, specific scroll depths, and human-made corrections when the system lags. This telemetry transforms the raw video into a vision-language-action (VLA) training set.

Meanwhile, the platform’s Magic Redaction automatically obfuscates sensitive data like passwords or credit card numbers during capture, ensuring content remains secure and HIPAA-aligned.

"Every time you click a button, you drag-and-drop, you scroll, you type, we collect interactions… all of it, we clean it up – there’s no private information in it," Yoav Einav, co-founder and CEO of Guide, explained in an exclusive interview with VentureBeat.

Under the hood, the platform captures underlying metadata and DOM (Document Object Model) changes synchronized with video frames. The differentiator is telemetry hidden beneath the surface.

This rich metadata creates a "digital world model" Of enterprise software. And because each enterprise uses its own unique mix of apps and processes, Guide is building a data moat that allows enterprise agents to reason through legacy UI with the same spatial awareness as a human, ensuring that automation actually works in a production environment rather than just a lab demo.

To a human, it’s a tutorial. For an AI agent, this is a high-fidelity map of the interface. This allows agents "Look" and reason through complex UIs the way humans do, solving problems "last mile" of automation where agents had previously failed due to lack of specific enterprise and in-situ usage context.

In a sense, Guide is creating a "self driving car" Like Waymo for computer use.

Product: Three Pillars of Guide-ance

The platform has evolved into three distinct products designed to suit the maturity of the organization:

create guide: An engine for subject matter experts to transform workflows into documentation in minutes.
guide broadcast: A personalized recommendation engine – often compared to Netflix – that provides answers inside the tools people actually use. It knows who the user is and what department they belong to to surface relevant content when needed.
Guided Discover:newly launched "agentic" Pillar. Just as Waze maps roads by observing drivers, map software discovers routes by tracking how employees work. It understands the workflow, creates content, and automatically updates it when the UI changes.

Training humans to use AI – and AI to use humans

The most non-obvious aspect of GIDE’s development is its dual-purpose mission. "We are the only platform that trains both humans and agents," Inav said.

As companies launch AI tools like Microsoft 365 Copilot or ServiceNow Agent, they open up an efficiency gap. One of Guidee’s biggest customers revealed that they were paying more than $1 million per year for a sophisticated AI tool "Nobody knows how to use them because they had like a 30 minute training session, and then that’s it." Guide bridges this gap by providing "byte size" Video tutorial on work flow.

Plus, these videos train the AI agents themselves. Foundation models like Gemini or GPT-4 often suffer when working with specific enterprise workflows because they were not trained on highly specific, internal. "vanilla workflows" Found in private enterprise systems. provides guide "departure point," "metadata," and this "x,y coordinates of the button" An agent has to complete a task without getting stuck.

multimodal benefits

To maintain this level of accuracy, Gaide uses a multimodal infrastructure. The system is not dependent on any one model; Instead, it uses a "fleet" Models that evaluate each other.

google gemini: Generally used for visual tasks like analyzing PDF or PowerPoint.
anthropic cloud:Availed for writing story and narrative scripts.
feedback loops: When a user edits a video, that data is fed back into the model to prevent the same mistakes from being made in future captures.

This approach allows Guides to replace a legacy stack of six or seven disconnected tools – Loom for capture, Adobe Premiere for editing, 11Labs for text-to-speech, and Synthesia for avatars – with a single, AI-native platform. "We basically pack everything for you," Inav says, "And automate the entire process based on your brand guidelines."

Video-First Original Story

The genesis of Gaidee is rooted in a frustration familiar to any product leader. Before founding the company, Inav and co-founder Dan Sahar spent years mastering video traffic at Quilt, a company they started in 2010 to analyze how people watch Netflix and Disney+.

When COVID-19 hit, they saw a great opportunity to apply that video expertise in the workplace. They found that short video explainers could increase free-to-paid account conversions by 30%, but the hassle of creating them wasn’t sustainable.

Inav recalled in an interview "tedious work" Of the old world: "My team in Israel was creating the content, someone in the US with an American accent was narrating, someone in the marketing team was writing the script… and someone in the enablement team was editing." This fragmented workflow meant that a video took two to three weeks to produce. "And then two weeks later, the product changes, and you need to make it afresh," Inav added.

The guide was created to condense this cycle into seconds. by automating "magical possession" In the workflow, the platform instantly generates a structured narrative script and professional AI voiceover. It removes the editing barrier, converting subject matter experts "Training powers."

Licensing and market effects

Guide’s pricing structure reflects its transition from a utility to a core part of enterprise infrastructure:

Free: $0 (up to 25 videos, web-app support).
Pro: $18/creator/month (unlimited videos, branded kits).
Business: $39/creator/month (unlimited text-to-voice, analytics).
enterprise: Custom pricing (Multilingual Translation, SSO, Magic Reduction).

The impact of the platform is already visible in the numbers: a 41% reduction in video production time And 34% less inbound support tickets.

For customers like Emerson, this translates to 40-60% faster guide creation. Support teams, in particular, are finding that they can sell 80% of their ticket volume with agents – but only if those agents have useful content.

"The agent is useless without the material," Inav cautions that most enterprise documents are either years old or completely undocumented.

Early welcome to the community and industry

Guidee already claims 4,500 enterprise customers and is looking to expand this number with its new round of funding. Support and operations leaders have been vocal about the ease of use of the platform. Christopher Cummings, Vice President of Client Experience at DockerNetwork, highlighted its ability to provide "Instant, personalized video answers to customer questions."

Meanwhile, Wren Cotron, director of customer support, said that "Once you’ve got the branding set up the way you want it, you can really zoom in on this thing."

PSG Managing Director Ronen Neer summarized the investment thesis: "GIDE is solving one of the biggest barriers to successful AI adoption: knowledge infrastructure."

Why does it matter now?

The shift from text-only LLM to agentic video intelligence is the defining trend of 2026. Series B of the Guide indicates that "Truth" For enterprise agents this will come from raw video observation, not static documentation.

By capturing how work is done across tens of millions of workflows, Guide is creating a dataset that few others have.

As Inav said: "It begins with humans and moves toward full autonomy over time." For the modern enterprise, the map is no longer a static document – it is a living, breathing video intelligence layer that guides both the workforce and the agents supporting them.

<a href

Visual imitation learning: Guidde trains AI agents on human 'expert video' instead of documentation

Technology: from video capture to world models

Product: Three Pillars of Guide-ance

Training humans to use AI – and AI to use humans

multimodal benefits

Video-First Original Story

Licensing and market effects

Early welcome to the community and industry

Why does it matter now?

Like this:

Related

Leave a Comment Cancel reply

Technology: from video capture to world models

Product: Three Pillars of Guide-ance

Training humans to use AI – and AI to use humans

multimodal benefits

Video-First Original Story

Licensing and market effects

Early welcome to the community and industry

Why does it matter now?

Share this:

Like this:

Related

Leave a Comment Cancel reply