
Long-horizon reasoning exposes a core weakness in AI agents: context windows fill rapidly, and retrieval pipelines return noise rather than signal.
To solve this, researchers at the National University of Singapore developed MRAgent, a framework that bypasses the static "recover-then-reason" Approach Instead, it uses a mechanism that allows the agent to dynamically develop its memory based on accumulating evidence.
This multi-step memory reconstruction is integrated into the reasoning process of large language models (LLMs). Although not the only framework in this area, MRAgent significantly reduces token consumption and runtime costs compared to other agentic memory management approaches.
Limitations of passive recovery in long-horizon operations
In classic retrieval pipelines, documents are retrieved via vector search or graph traversal and sent to the LLM for logic. This passive approach fails because it cannot couple logic with memory accesses, creating three major bottlenecks:
- These systems cannot modify their recovery strategy mid-argument. If an agent receives a document and detects an important missing clue – a specific date or person – it has no way to issue a new query based on that finding.
-
Fixed similarity scores and predefined graph extensions return surface-level matches that fill the LLM’s context window with irrelevant noisy, degenerate logic.
-
Current systems rely heavily on pre-built constructs such as top-k results and static relevance functions, which limit the flexibility needed to scale to unpredictable, long-horizon user interactions.
The researchers argue that to overcome these limitations, developers should move toward an “active and collaborative reengineering process”, a concept inspired by cognitive neuroscience.
Under this paradigm, memory recall occurs sequentially rather than serving as a passive read-out of a static database. The system starts with small, specific triggers prompted by the user, such as a person’s name, an action, or a location. These initial prompts point to connecting concepts or categories rather than huge blocks of text.
By following these metadata steps, the agent collects small pieces of evidence one by one. It uses each new information to guide its next steps until it successfully pieces together a complete, accurate story.
How does MRAgent implement active memory rebuilding?
Instead of viewing memory as a static database, MRAgent (Memory Reasoning Architecture for LLM Agents) treats it as an interactive environment. When processing a complex query, the agent uses the reasoning capabilities of the Backbone LLM to explore multiple candidate retrieval paths in a structured memory graph.
At each step, the LLM evaluates the intermediate evidence it has collected and uses it to iteratively optimize its search. It estimates new search constraints, follows the paths with the best information, and prunes irrelevant branches. This allows MRAgent to piece together deeply buried information without flooding the context of the LLM with noise.
To make this active exploration computationally efficient and scalable, the framework organizes its database using a “Q-Tag-Content” mechanism. It functions as a multilevel associative graph with three node types:
- Signal: Micro keywords, such as entities or contextual features extracted from user interactions.
-
Material: Actual stored memory units. These are divided into multi-grained layers, such as episodic memory for concrete events and semantic memory for stable facts and user preferences.
-
tag: Semantic bridges that summarize relational relationships between specific signals and content.
This structure enables a highly efficient two-step recovery process. LLM first goes from signals to candidate tags. Because tags explicitly highlight the semantic relationships and structural associations of the data, the agent evaluates these summaries to assess their relevance. LLM identifies promising traversal paths and discards irrelevant branches before spending computation and prompt tokens to access detailed, heavy memory content.
For example, a user might ask an AI agent, "When Nate won his third video game tournament how did he use the prize money?"
- MRAgent extracts subtle initial cues from the first prompt, e.g. "Nate," "video game tournament," And "win."
-
The agent maps these initial signals onto a memory graph and looks at the available associative tags associated with them. agent sees tags like "tournament victory" And "Tournament participation.” Since it is only concerned with what the person did after winning the championship, MRAgent removes the tournament participation tag and chases the win tag.
-
The agent retrieves episodic content associated with the chosen Q-Tag pair, retrieving three different memory episodes where Nate won a tournament.
-
MRAgent looks at three memories, decides that one of them is particularly relevant to the query, and discards the other two.
-
With this information, it updates its signals and begins another round of search and pruning. From the new episodic memory acquired, the agent adds “tournament income” to its signals and uses it to cross-reference new tags and home in on new memories. It repeats this process until it has collected enough information to answer the question, which might be something like “Nate saved the money.”
MRAgent’s performance on industry benchmarks
MRAgent works with several other frameworks addressing agentic memory building. Alternatives include A-MEM, a graph-based agentic memory framework, and MemoryOS, a hierarchical memory framework. Other persistent memory frameworks include LangMem and Mem0.
The researchers tested MRAgent on the LoCoMo and LongMemEval industry benchmarks. These test agents’ abilities to solve questions over long-term tasks and conversations over dozens of sessions and hundreds of rounds of dialogue. The backbone models used were Gemini 2.5 Flash and Cloud Sonnet 4.5. The system was tested against the standards RAG, A-MEM, MemoryOS, LangMem and Mem0.
MRAgent consistently outperformed each baseline by a significant margin in both models and across all query types.
However, for enterprise developers, the most important metric is often computational cost. In the LongMemEval tests, MRAgent reduced instant token consumption to only 118k per sample. By comparison, A-Meme consumed 632k tokens, and LangMeme burned 3.26 million tokens per query. MRAgent effectively halved the runtime compared to A-MEM, reducing from 1,122 seconds to 586 seconds.
What makes MRAgent practical is its on-demand behavior. Evaluating tags and pruning irrelevant paths before retrieval saves money and reference space. Furthermore, the system autonomously evaluates its cached context and naturally knows when to stop searching, completely avoiding unnecessary data exploration.
implementation and development hold
While MRAgent is highly effective, it requires the Cue-Tag-Content structure to be prepared before the agent can perform a query. Developers must figure out how to architect the underlying in-memory database to enable the LLM to efficiently navigate associative objects and minimize irrelevant paths without exploding in computation costs.
Fortunately, developers don’t need to manually label or structure this data. The authors designed MRAgent with an automated distillation pipeline that uses LLM to process the raw interaction history and automatically populate the memory graph. For a developer, the job is to implement and organize this automated ingestion pipeline rather than manually tagging data.
You need to set up a background task or streaming pipeline that passes raw user interactions through instantiated templates to extract this metadata before storing it in your graph database.
However, the authors emphasize that this is a lightweight construction step and MRAgent intentionally keeps ingestion simple.
The authors have released the code on GitHub.
<a href