AI agents fail 63% of the time on complex tasks. Patronus AI says its new 'living' training worlds can fix that.

nuneybits Vector art of robot facing infinite doors dad9aaaf e4d9 471d b15e 94038ee67004
Petronas AI, an artificial intelligence assessment startup backed by $20 million from investors including Lightspeed Venture Partners and Datadog, on Tuesday unveiled a new training architecture that it says represents a fundamental shift in how AI agents learn to perform complex tasks.

what the company calls technology "generative simulator," Creates an adaptive simulation environment that constantly generates new challenges, dynamically updates rules, and evaluates the agent’s performance as it learns – all in real time. This approach marks a departure from the static benchmarks that have long served as the industry standard for measuring AI capabilities but have increasingly come under criticism for failing to predict real-world performance.

"Traditional benchmarks measure discrete capabilities, but they miss the interruptions, context switches, and layered decision making that define real work," said Anand Kannappan, chief executive and co-founder of Petronas AI, in an exclusive interview with VentureBeat. "For agents to perform at human levels, they need to learn like humans through dynamic experience and continuous feedback."

This announcement comes at a critical moment for the AI ​​industry. AI agents are reshaping software development, from writing code to carrying out complex instructions. Yet LLM-based agents are prone to errors and often perform poorly in complex, multi-step tasks. Research published earlier this year found that an agent with just a 1% error rate per step could increase the probability of failure by 63% by the hundredth step – a sobering statistic for enterprises looking to deploy autonomous AI systems at large scale.

Why static AI benchmarks are failing – and what’s next

PETRONAS AI’s approach addresses what the company describes as a growing mismatch between how AI systems are evaluated and how they actually perform in production. The company argues that traditional benchmarks function like standardized tests: They measure specific abilities at a fixed point in time but struggle to capture the messy, unpredictable nature of real work.

The new generative simulator architecture flips this model. Instead of presenting agents with a fixed set of questions, the system instantly generates assignments, environmental conditions, and inspection procedures, then adapts based on how the agent behaves.

"Over the past year, we have seen a shift away from traditional static benchmarks towards more interactive learning bases," Rebecca Qian, chief technology officer and co-founder of Petronas AI, told VentureBeat. "This is partly due to the innovation we have seen from model developers – the shift towards reinforcement learning, post-training and continuous learning, and away from supervised instruction tuning. This means that the gap between training and assessment has disappeared. Have become benchmark environments."

The technology is based on reinforcement learning – an approach where AI systems learn through trial and error, receiving rewards for correct actions and penalties for mistakes. Reinforcement learning is an approach where AI systems learn to make optimal decisions by improving through trial and error, by receiving rewards or punishments for their actions. RL can help improve agents, but it usually requires developers to extensively rewrite their code. This discourages adoption, even though the data these agents generate can significantly boost performance through RL training.

Patronus AI also introduced a new concept it calls "open recursive self-correction," or ORSI – environments where agents can continuously improve through interaction and feedback without requiring a full retraining cycle between attempts. The company positions it as critical infrastructure for developing AI systems capable of learning continuously rather than remaining static at one point in time.

Inside the ‘Goldilocks Zone’: How adaptive AI training finds the sweet spot

At the core of the generative simulator is what Patronus calls AI "course adjuster" – A component that analyzes the agent’s behavior and dynamically modifies the difficulty and nature of the training scenarios. This approach takes inspiration from how effective human teachers adapt their instruction based on student performance.

Qian explained the approach using an analogy: "You can think of it as a teacher-student model, where we are training the model and the professor continuously follows the curriculum."

This adaptive approach solves a problem that Kannappan described as finding "goldilocks zone" In training data – ensuring that examples are neither too easy nor too hard for a given model to learn effectively.

"The important thing is not whether you can train on a data set, but whether you can train on a high-quality data set that is consistent with your model—one that it can actually learn from." Kannappan said. "We want to make sure that the examples are neither too difficult nor too easy to model."

The company says initial results show meaningful improvements in agents’ performance. According to the company, training on the PETRONAS AI environment has increased task completion rates by 10% to 20% in real-world tasks including software engineering, customer service and financial analysis.

The problem of AI fraud: How ‘moving target’ environments prevent bounty hacking

One of the most persistent challenges in training AI agents through reinforcement learning is a phenomenon that researchers call "bounty hacking"-Where systems learn to exploit flaws in their training environment rather than actually solving problems. Famous examples include early agents who learned to hide in corners of video games rather than actually play them.

Generative simulators solve this by making the training environment a moving target.

"Reward hacking is basically a problem when systems are stable. It’s like students learning how to cheat in an exam," Kian said. "But when we are constantly evolving the environment, we can actually see parts of the system that need to adapt and evolve. Static benchmarks are fixed targets; Generator simulator environments are moving targets."

PETRONAS AI reports 15x revenue growth as enterprise demand for agent training increases

Patronus positions AI generative simulators as the foundation of a new product line it calls "rl environment" – Designed training grounds for foundation model laboratories and enterprise creation agents for specific domains. The company says the offering represents a strategic expansion beyond its core focus on evaluation tools.

"Our revenue has grown 15x this year, largely due to the high-quality environments we have developed that have been shown to be highly learnable by a variety of frontier models," Kannappan said.

The CEO declined to disclose full revenue figures, but said the new product has allowed the company "Move up the stack in terms of where we sell and who we sell to." The company’s platform is used by many Fortune 500 enterprises and leading AI companies around the world.

Why can’t OpenAI, Anthropic, and Google build everything in-house?

A central question facing Petronas AI is why the deep-pocketed labs developing frontier models – organizations like OpenAI, Anthropic and Google DeepMind – would license the training infrastructure rather than building it themselves.

Kannappan admitted that these companies "making significant investments in the environment" But argued that the breadth of domains requiring specialized training creates natural opportunities for third-party providers.

"They want to make agents better on many different domains, whether it’s coding or using tools or navigating a browser or workflows in finance, health care, energy and education," He said. "It is very difficult for any one company to solve all those different operational problems."

The competitive landscape is intensifying. Microsoft recently released Agent Lightning, an open-source framework that makes reinforcement learning work for any AI agent without rewriting. NVIDIA’s NeMo Gym provides a modular RL infrastructure for developing agentic AI systems. Meta researchers released DreamGym in November, a framework that simulates RL environments and dynamically adjusts task difficulty as agents improve.

‘Environment is the new oil’: Petronas AI’s bold bet on the future of AI training

Looking ahead, Patronus AI frames its mission broadly. the company wants "Environmentalize all the world’s data" – Transforming human workflows into structured systems from which AI can learn.

"We think everything should be an environment—internally, we joke that the environment is the new oil," Kannappan said. "Reinforcement learning is just a training method, but creating an environment is what really matters."

Qian described the occasion in broad terms: "This is a completely new area of ​​research, which doesn’t happen every day. Generative simulation is inspired by early research in robotics and embodied agents. This has been an unrealistic dream for decades, and we are only now able to achieve these ideas because of the capabilities of today’s models."

The company launched in September 2023 with a focus on evaluation – helping enterprises identify hallucinations and security issues in AI outputs. That mission now extends to training upstream itself. Patronus AI argues that the traditional separation between assessment and training is breaking down – and whoever controls the environments where AI agents learn will shape their capabilities.

"We’re really at this tipping point, this inflection point, where what we do now will impact the world for generations to come," Kian said.

It remains to be seen whether generative simulators can deliver on that promise. The company’s 15x revenue growth shows that enterprise customers are hungry for solutions, but deep-pocketed players from Microsoft to Meta are racing to solve the same fundamental problem. If the past two years have taught the industry anything, it’s that AI has a habit of arriving at the future prematurely.



<a href

Leave a Comment