New Framework Lets AI Agents Rewrite Their Own Skills Without Retraining The Underlying Model

A major challenge in deploying autonomous agents is building systems that can adapt to changes in their environment without the need to retrain the underlying large language models (LLMs).

mnemonic skillsA new framework, developed by researchers from several universities, addresses this barrier by giving agents the ability to develop their own skills. "it connects it continuous learning Capability of existing offerings in the market, such as OpenClave and Cloud Code," Jun Wang, co-author of the paper, told VentureBeat.

Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.

For enterprise teams running agents in production, this matters. The alternative – fine-tuning model weights or manually building skills – carries significant operational overhead and data requirements. Memento-skill puts both of them aside.

Challenges of creating self-developed agents

Self-evolving agents are important because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain constant, limiting it to the knowledge it encoded during training and whatever fits in its immediate context window.

Giving the model external memory scaffolding helps it improve without the expensive and slow process of retraining. However, current approaches to agent optimization largely rely on manually designed skills to handle new tasks. While some automated skill-learning methods exist, they mostly produce text-only guides that lead to quick adaptation. Other approaches only log single-task trajectories that are not transferred across different tasks.

Furthermore, when these agents attempt to acquire knowledge relevant to a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; High semantic overlap does not guarantee practical usefulness. An agent relying on standard RAG can retrieve "password reset" script to solve "Refund Processing" Query only because the documents share enterprise terminology.

"Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as Markdown documents or code snippets, similarity alone cannot select the most effective skills," Wang said.

How Memento-Skills stores and updates skills

To solve the limitations of current agentic systems, researchers created Memento-Skills. The paper describes the system as “a generalist, continuously learnable LLM agent system that acts as an agent-designing agent.” Rather than keeping a passive log of past conversations, Memento-Skills creates a set of skills that acts as a persistent, evolving external memory.

These skills are stored as structured Markdown files and serve as the agent’s evolving knowledge base. Each reusable skill artifact is made up of three main elements. It contains declarative specifications that explain what the skill is and how it should be used. It contains special instructions and hints that guide the logic of the language model. And it contains executable code and helper scripts that the agent runs to actually solve the task.

Memento-skill is achieved through continuous learning through "reading and writing reflective teaching" Mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a particular skill router to retrieve the most behaviorally relevant skill – not just the most semantically similar – and executes it.

After the agent executes the skill and receives feedback, the system considers the results to close the learning loop. Instead of adding a log of what happened, the system actively alters its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts you to patch a specific failure mode. This creates a whole new skill when needed.

Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. "The real value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,” Wang said. “Therefore, reinforcement learning provides a more appropriate framework, as it enables the agent to evaluate and select skills based on long-term utility."

To prevent regressions in a production environment, automatic skill mutations are protected by an automated unit-test gate. The system creates a synthetic test case, executes it through the update skill, and checks the results before saving the changes to the global library.

By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build stronger muscle memory and progressively expand its capabilities end-to-end.

testing self-developing agents

The researchers evaluated memento-skills on two rigorous standards. is the first General AI Assistant (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. the second one is final test of humanityor HLE, is an expert-level benchmark covering eight diverse academic subjects such as mathematics and biology. The entire system was operated by gemini-3.1-flash Acting as an underlying frozen language model.

The system was compared to a read-write baseline that retrieves skills and collects feedback but does not have self-developed features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embedding.

The results proved that actively self-developed memory significantly outperformed a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed large-scale cross-task skill reuse, the system more than doubled the baseline’s performance, increasing from 17.9% to 38.7%.

Furthermore, Memento-Skills’ special skill router avoids the classic retrieval trap where an irrelevant skill is selected due to semantic similarity. Experiments show that Memento-Skills increases the end-to-end task success rate to 80%, compared to only 50% for standard BM25 recovery.

The researchers observed that Memento-Skills manages this performance through highly organic, structured skill development. Both benchmark experiments started with only five atomic seed skills, such as basic web searching and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed set into a compact library of 41 skills to handle diverse tasks. On the expert-level HLE benchmark, the system dynamically expanded its library to 235 specific skills.

Finding a suitable location for the enterprise

Researchers have released the code for this Memento-Skills on GitHubAnd it is easily available for use.

For the enterprise architect, the effectiveness of this system depends on domain alignment. Rather than just looking at benchmark scores, the main business tradeoff lies in whether your agents are handling discrete tasks or structured workflows.

"Skill transfer depends on the degree of similarity between tasks," Wang said. "First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction." In such a dispersed environment, cross-task transfer is limited. "Second, when tasks share enough structure, previously acquired skills can be reused directly. Here, learning becomes more efficient as knowledge is transferred across tasks, allowing the agent to perform well on new problems with little or no additional interaction."

Given that systems require recurring work patterns to consolidate knowledge, enterprise leaders need to know where to deploy it today and where to stop.

"Workflows are probably the most appropriate setting for this approach, as they provide a structured environment in which skills can be created, evaluated and improved," Wang said.

However, he cautioned against over-deployment in areas that do not yet fit the framework. "The physical agents in this context are largely unknown and require further investigation. Furthermore, tasks with longer horizons may require more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and continuous execution over extended sequences of decisions."

As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs basic security rails like automated unit-test gates, enterprise adoption will require a broader framework.

"To enable reliable self-improvement, we need a well-designed evaluation or judging system that can assess performance and provide consistent guidance," Wang said. "Rather than allowing unrestricted self-modification, the process should be structured as a guided form of self-development, where feedback leads the agent toward better designs."

<a href

New framework lets AI agents rewrite their own skills without retraining the underlying model

Challenges of creating self-developed agents

How Memento-Skills stores and updates skills

testing self-developing agents

Finding a suitable location for the enterprise

Like this:

Related

Leave a Comment Cancel reply

Challenges of creating self-developed agents

How Memento-Skills stores and updates skills

testing self-developing agents

Finding a suitable location for the enterprise

Share this:

Like this:

Related

Leave a Comment Cancel reply