New Agent Framework Matches Human-engineered AI Systems — And Adds Zero Inference Cost To Deploy

Agents built on top of today’s models often break with simple changes—a new library, a workflow modification—and require a human engineer to fix it. This is one of the most persistent challenges in deploying AI for the enterprise: creating agents that can adapt to dynamic environments without constant hand-holding. Although today’s models are powerful, they are still largely static.

To address this, researchers at the University of California, Santa Barbara have developed group developing agents (GEA), a new framework that enables groups of AI agents to develop together, share experiences, and reuse their innovations to improve autonomously over time.

In experiments on complex coding and software engineering tasks, GEA significantly outperformed existing self-improvement frameworks. Perhaps most notably for enterprise decision makers, the systems autonomously developed agents that matched or exceeded the performance of frameworks painstakingly designed by human experts.

Limits of ‘Lone Wolf’ Development

most current agentic ai system Rely on fixed architectures designed by engineers. These systems often struggle to grow beyond the capacity limitations imposed by their initial designs.

To solve this, researchers have long sought to create self-evolving agents that can autonomously modify their code and structure to overcome their initial limitations. This ability is essential to handle open-ended environments where the agent must constantly seek new solutions.

However, the current approach to self-development has a major structural flaw. As the researchers write in their paper, most systems are inspired by biological evolution and are designed based on "person centered" Processes. These methods generally use a tree-structured approach: single "Parents" Agents are selected for in producing offspring, leading to separate evolutionary branches that remain strictly isolated from each other.

This separation creates a silo effect. An agent in one branch cannot access data, tools, or workflows searched by an agent in a parallel branch. If a specific lineage fails to be selected for the next generation, any valuable discoveries made by that agent, such as a new debugging tool or a more efficient testing workflow, are lost with it.

In their paper, the researchers question the need to adhere to this biological metaphor. "AI agents are not biological persons," logic. "Why should their development remain constrained by biological patterns?"

Collective Intelligence of Agents Developing Groups

GEA changes the paradigm by considering a group of agents rather than an individual as the fundamental unit of development.

The process begins by selecting a set of basic agents from the existing collection. To ensure a healthy mix of sustainability and innovation, GEA selects these agents based on a combined score of performance (ability to solve tasks) and innovation (how different their capabilities are from others).

Unlike traditional systems where an agent only learns from its direct parents, GEA creates a shared pool of collective experience. This pool contains the evolutionary traces of all members of the original group, including code modifications, successful solutions to tasks, and tool invocation history. Each agent in the group gains access to this collective history, allowing them to learn from the successes and mistakes of their peers.

The “reflection module”, driven by a larger language model, analyzes this collective history to identify group-wide patterns. For example, if one agent searches for a high-performance debugging tool while another completes a testing workflow, the system extracts both insights. Based on this analysis, the system generates high-level "development instructions" Which guides the formation of the child group. This ensures that the next generation has the combined strengths of all its parents rather than the traits of just one offspring.

However, this hive-mind approach works best when success is objective, such as in coding tasks. "For less deterministic domains (for example, creative generation), evaluation signals are weaker," paper co-authors Zhaotian Weng and Xin Eric Wang told VentureBeat in written comments. "Blindly sharing outputs and experiences can lead to low-quality experiences that act as noise. This suggests the need for stronger experience filtering mechanisms" For subjective actions.

GEA in action

Researchers tested GEA against current state-of-the-art self-developed baselines darwin godel machine (DGM), on two rigorous benchmarks. The results showed a huge jump in efficacy without increasing the number of agents used.

This collaborative approach makes the system more robust against failure. In their experiments, the researchers deliberately broke the agents by manually inserting bugs into their implementations. GEA was able to fix these critical bugs in an average of 1.4 iterations, while the baseline took 5 iterations. The system effectively takes advantage of "Healthy" The group members will diagnose and cure the affected people.

On SWE-Bench Verified, a benchmark consisting of real GitHub issues including bugs and feature requests, GEA achieved a 71.0% success rate compared to the baseline’s 56.7%. This has significantly increased autonomous engineering throughput, meaning agents are far more capable of handling real-world software maintenance. Similarly, on Polyglot, which tests code generation in different programming languages, GEA achieved 88.3% versus 68.3% for the baseline, indicating higher adaptability to different technology stacks.

For enterprise R&D teams, the most important finding is that GEA allows AI to design as effectively as human engineers. On the SWE-bench, GEA’s 71.0% success rate effectively matches the performance open handsTop human-designed open-source framework. On Polyglot, GEA significantly outperformed the popular coding assistant Adder, which achieved 52.0%. This suggests that organizations may eventually reduce their reliance on large teams of engineers to quickly adapt to changes in agent architecture, as agents can meta-learn these adaptations autonomously.

This efficiency extends to cost management. "GEA is clearly a two-step system: (1) agent development, then (2) estimation/deployment," The researchers said. "After development, you deploy a single developed agent… so the enterprise estimate cost is essentially unchanged compared to a standard single-agent setup."

The success of the GEA stemmed largely from its ability to integrate reforms. Researchers tracked specific innovations invented by agents during the evolutionary process. In the basal view, valuable tools often appeared in different branches but failed to propagate because they became extinct in specific lineages. At GEA, the shared experience model ensured that these tools were adopted by the best performing agents. The top GEA agent integrated traits from 17 unique ancestors (representing 28% of the population), while the best baseline agent integrated traits from only 9. In fact, GEA makes a "over-staffed" Which contains the combined best practices of the entire group.

"The GEA-inspired workflow in production would allow agents to first attempt some independent fixes when a failure occurs," Researchers explained this self-healing ability. "A reflection agent (usually driven by a strong foundation model) can summarize the results… and guide more comprehensive system updates."

Furthermore, the improvements discovered by GEA are not tied to any specific underlying model. Agents developed using one model, such as Cloud, retained their performance gains even when the underlying engine was changed to another model family, such as GPT-5.1 or GPT-O3-mini. This transferability gives enterprises the flexibility to switch model providers without losing the custom architectural customizations learned by their agents.

For industries with strict compliance requirements, the idea of a self-modifying code may seem risky. To address this, the authors said: "We expect enterprise deployments to include non-developed guardrails, such as sandbox execution, policy constraints, and validation layers."

While the researchers plan to release official code soon, developers can already begin to conceptually implement the GEA architecture on top of existing agent frameworks. The system requires three major additions to the standard agent stack: an “experience store” to store evolutionary traces, a “reflection module” to analyze group patterns, and an “update module” that allows the agent to modify its own code based on those insights.

Looking ahead, the framework could democratize advanced agent development. "A promising direction is the hybrid evolution pipeline," Researchers said, "Where smaller models seek early to accumulate diverse experiences, and stronger models later use those experiences to guide development."

<a href

New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

Limits of ‘Lone Wolf’ Development

Collective Intelligence of Agents Developing Groups

GEA in action

Like this:

Related

Leave a Comment Cancel reply

Limits of ‘Lone Wolf’ Development

Collective Intelligence of Agents Developing Groups

GEA in action

Share this:

Like this:

Related

Leave a Comment Cancel reply