MemRL outperforms RAG on complex agent benchmarks without fine-tuning

self evolving agent
A new technique developed by researchers at Shanghai Jiao Tong University and other institutions enables large language model agents to learn new skills without the need for expensive fine-tuning.

Researchers propose memrlA framework that gives agents the ability to develop episodic memory, the ability to retrieve past experiences to create solutions to unseen tasks. MemRL allows agents to use environmental feedback to continuously refine their problem-solving strategies.

MemRL is part of a broader effort to develop the research community continuous learning Capabilities for AI applications. In experiments on major industry benchmarks, the framework outperformed other baselines such as RAG and other memory organization techniques, especially in complex environments where exploration and experiments are required. This suggests that MemRL could become a key component for building AI applications that must operate in dynamic real-world settings where requirements and tasks are constantly changing.

stability-plasticity dilemma

One of the central challenges in deploying agentic applications is to adapt the underlying model to new knowledge and tasks after an initial training phase. Current approaches generally fall into two categories: parametric approaches, such as fine tuningand non-parametric approaches, such as RAG. But both come with significant trade-offs.

Fine-tuning, while effective for generating new information, is computationally expensive and slow. More seriously, it often happens disastrous mistakeA phenomenon where newly acquired knowledge overwrites previously learned data, causing the general performance of the model to degrade.

In contrast, non-parametric methods such as RAG are fundamentally inductive; They retrieve information based only on semantic similarity, such as vector embeddings, without evaluating the actual usefulness of the information to the input query. This approach assumes that "Similar means useful," Which is often flawed in complex logic tasks.

Researchers argue that human intelligence solves this problem by maintaining a “delicate balance between the stability of cognitive reasoning and the plasticity of episodic memory”. In the human brain, static reasoning (associated with the cortex) is separated from dynamic episodic memory. It allows humans to adapt to new tasks without "Rewiring Neural Circuitry" (roughly equivalent to model fine-tuning).

Inside the MemRL framework

Inspired by the use of episodic memory and cognitive reasoning by humans, MemRL is designed to enable an agent to continuously improve its performance after deployment without compromising the stability of its backbone LLM. Instead of changing the parameters of the model, the framework transfers the optimization mechanism to an external, self-evolving memory structure.

In this architecture the parameters of the LLM remain completely frozen. The model works effectively "cortex," Is responsible for general logic, reasoning, and code generation, but it is not responsible for storing specific successes or failures that occur after deployment. This structure ensures stable cognitive reasoning and prevents catastrophic forgetting.

To handle adaptation, MemRL maintains a dynamic episodic memory component. Instead of storing plain text documents and static embedding values, as is common in RAGs, MemRL organizes memory. "intention-experience-utility" Triple. These include the user’s query (intent), the specific solution trajectory or action taken (experience), and a score, known as a Q-value, that reflects how successful this specific experience was in the past (utility).

Importantly for enterprise architects, this new data structure does not require breaking existing infrastructure. "MemRL is designed as a ‘drop-in’ replacement for the retrieval layer in existing technology stacks and is compatible with a variety of vector databases," Muning Wen, co-author of the paper and a PhD candidate at Shanghai Jiao Tong University, told VentureBeat. "The existence and updating of ‘Q-value’ is purely for better evaluation and management of dynamic data… and is independent of the storage format."

This utility score is the main difference from the classic RAG system. At the time of estimation, MemRL agents employ a "two stage recovery" Mechanism. First, the system identifies memories that are semantically close to the query to ensure relevance. It then re-ranks these candidates based on their Q-value, giving priority to strategies proven effectively.

The framework incorporates reinforcement learning directly into the memory retrieval process. When an agent attempts a solution and receives environmental feedback (i.e., success or failure) it updates the cue-value of the retrieved memory. This creates a closed feedback loop: over time, the agent learns to ignore distracting memories and prioritize high-value strategies without the need to retrain the underlying LLM.

While adding a reinforcement learning step may seem like it adds significant latency, Wayne said the computational overhead is minimal. "Our Q-value calculations are performed entirely on the CPU," He said.

MemRL also has runtime continuous learning capabilities. When the agent encounters a new scenario, the system uses the frozen LLM to summarize the new trajectory and adds it to the memory bank as a new triplet. This allows the agent to dynamically expand its knowledge base as it interacts with the world.

It’s worth noting that automating value assignment comes with a risk: If the system accidentally validates bad interactions, the agent may learn the wrong lessons. Wayne admits it "poisonous memory" But note that unlike black-box neural networks, MemRL remains transparent and auditable. "If a bad conversation is mistakenly classified as a positive example… it can spread more widely," Wayne said. "However… we can easily fix this by deleting the corrupted data from the memory banks or resetting their Q-values."

MemRL in action

The researchers evaluated MemRL based on several baselines on four diverse industry benchmarks: BigCodeBench (code generation), ALFWorld (embodied navigation), Lifelong Agent Bench (OS and database interactions), and Humanity’s Last Exam (complex multidisciplinary reasoning).

The results showed that MemRL consistently outperformed the baseline in both runtime learning (improving during a session) and transfer learning (generalizing to unseen tasks).

The advantages of this value-aware retrieval mechanism were most evident in exploration-heavy environments like ALFWorld. In this benchmark, which requires agents to navigate and interact with a simulated home environment, MemRL achieved a relative improvement of about 56%. MempiAnother agentic memory framework. The researchers found that the reinforcement learning component effectively encouraged the agent to search and find solutions to complex tasks that similarity-based retrieval methods often failed to solve.

When the memory bank was frozen and tested on a hold-out set to measure generalization, MemRL achieved the highest accuracy across all benchmarks. For example, on the Lifelong Agent bench, it improved significantly over the standard RAG baseline on OS tasks. This indicates that the system does not simply remember the training data, but effectively filters out low-value memories to retain high-utility experiences that generalize to new situations.

The broader picture for self-developed agents

MemRL fits within a growing body of research focused on memory-based Markov decision processes (M-MDP), a formulation that frames memory retrieval as an active decision-making step rather than a passive search function. By treating retrieval as a function that can be optimized through reinforcement learning, frameworks such as MemRL and similar approaches Memento Paving the way for more autonomous systems.

For enterprise AI, this change is critical. This suggests a future where agents can be deployed with general-purpose LLMs and then rapidly customized to specific company workflows, proprietary databases, and unique problem sets through a single interaction. The main change we are seeing is frameworks that are treating applications as dynamic environments that they can learn from.

These emerging capabilities will allow organizations to maintain consistent, high-performing agents that evolve with their business needs, solving the problem of legacy models without incurring the prohibitive costs of constant retraining.

This marks a change in how we value data. "In a future where static data is about to disappear, the interaction experience generated by each intelligent agent during its lifetime will become the new fuel," Wayne said.



<a href

Leave a Comment