Alibaba's AgentEvolver lifts model performance in tool use by ~30% using synthetic, auto-generated tasks


Researchers at Alibaba’s Tongyi Lab have developed a new framework for self-evolving agents that create their own training data by exploring their application environment. structure, AgentEvolverUses the knowledge and reasoning capabilities of large language models for autonomous learning, addressing the high cost and manual effort typically required to collect task-specific datasets.

Experiments show that compared to traditional reinforcement learning-based frameworks, AgentEvolver is more efficient in exploring its environment, makes better use of data, and adapts faster to the application environment. For enterprises, this is important because it lowers the barrier to training agents for particular applications, making powerful, custom AI assistants more accessible to a broader range of organizations.

High cost of training AI agents

reinforcement learning It has become a dominant paradigm to train LLMs to function as agents that can interact with digital environments and learn from feedback. However, developing agents with RL faces fundamental challenges. First, collecting the required training datasets is often extremely expensive, requiring significant manual labor to create examples of tasks, especially in new or proprietary software environments where no off-the-shelf datasets are available.

Second, RL techniques commonly used for LLM require the model to undergo a large number of trial-and-error efforts to learn it effectively. This process is computationally expensive and inefficient. As a result, training capable LLM agents through RL remains laborious and expensive, limiting their deployment in custom enterprise settings.

How does AgentEvolver work?

The main idea behind AgentEvolver is to give models more autonomy in their learning process. The researchers describe it as a “self-evolving agent system” designed to “achieve autonomous and efficient capability development through environmental interaction.” It uses the reasoning power of LLMs to create self-training loops, allowing the agent to continuously improve by interacting directly with its target environment without the need for predefined actions or reward functions.

“We envision an agent system where LLM actively guides exploration, task creation, and performance refinement,” the researchers wrote. his paper,

The self-evolution process is driven by three main mechanisms that work together.

is the first self inquiryWhere the agent explores its environment to discover the limits of its actions and identify useful situations. It’s like a new user clicking into an application to see what’s possible. Based on this exploration, the agent generates its own set of diverse actions that correspond to the user’s general preferences. This reduces the need for handcrafted datasets and allows the agent and its tasks to co-evolve, making it increasingly capable of tackling more complex challenges.

According to Yunpeng Zhai, an Alibaba researcher and co-author of the paper, who spoke to VentureBeat, the self-query mechanism effectively changes the model from “data consumer to data producer,” dramatically reducing the time and cost required to deploy an agent in a proprietary environment.

The second mechanism is self navigateWhich improves exploration efficiency by reusing and generalizing past experiences. AgentEvolver extracts insights from both successful and unsuccessful attempts and uses them to guide future actions. For example, if an agent tries to use an API function that does not exist in an application, it registers this as an experience and learns to verify the existence of the functions before trying to use them in the future.

Third system, hold oneself accountableIncreases learning efficiency by providing more detailed feedback. Instead of just a final success or failure signal (a common practice in RL that can result in sparse rewards), this mechanism uses LLMs to assess the contribution of each individual action to a multi-step task. It retrospectively determines whether each step contributed positively or negatively to the final outcome, providing the agent with subtle feedback that accelerates learning.

This is important for regulated industries where how an agent solves a problem is as important as the outcome. “Instead of just rewarding a student for the final answer, we also evaluate the clarity and correctness of each step in their reasoning,” Zhai explains. This improves transparency and encourages agents to adopt more robust and auditable problem-solving patterns.

The researchers say, “By shifting the training initiative from human-engineered pipelines to LLM-guided self-improvement, AgentEvolver establishes a new paradigm that paves the way for scalable, cost-effective, and continuously improving intelligent systems.”

The team has also developed a practical, holistic training framework that integrates these three mechanisms. A key part of this foundation is context managerA component that controls the agent’s memory and interaction history. While today’s benchmarks test a limited number of devices, real enterprise environments may contain thousands of APIs.

Zhai acknowledges that this is a core challenge for the field, but notes that AgentEvolver was designed to scale. “Retrieval over extremely large action spaces will always introduce computational challenges, but AgentEvolver’s architecture provides a clear path toward scalable tool reasoning in enterprise settings,” he said.

A more efficient path to agent training

To measure the effectiveness of their framework, the researchers tested it appworld And BFCL v3Two benchmarks that require agents to perform long, multi-step tasks using external tools. They used models from Alibaba Quen2.5 family (7b and 14b parameters) and compared their performance against a baseline model trained with GRPO, a popular RL technique used to develop logic models DeepSeek-R1,

The results showed that integrating all three mechanisms into AgentEvolver resulted in substantial performance gains. For the 7B model, the average score improved by 29.4%, and for the 14B model, it increased by 27.8% from baseline. The framework consistently enhanced the model’s reasoning and performance capabilities in both benchmarks. The most significant improvement came from the self-query module, which autonomously generates diverse training tasks and directly solves the problem of data scarcity.

The experiments also showed that AgentEvolver can efficiently synthesize large amounts of high-quality training data. The tasks generated by the self-question module proved to be diverse enough to achieve good training efficiency even with little data.

For enterprises, this provides a way to create custom applications and agents for internal workflows, reducing the need for manual data annotation. By providing high-level goals and allowing agents to generate their own training experiences, organizations can develop custom AI assistants more simply and cost-effectively.

The researchers concluded, “This combination of algorithmic design and engineering practicality establishes AgentEvolver as both a research vehicle and a reusable basis for building adaptive, device-augmented agents.”

Looking ahead, the ultimate goal is huge. “A truly ‘unique model’ that can enter any software environment and master it overnight is certainly the holy grail of agentic AI,” Zhai said. “We see AgentEvolver as a necessary step in that direction.” While that future still requires breakthroughs in model logic and infrastructure, self-developed approaches are paving the way.



<a href

Leave a Comment