GAM takes aim at “context rot”: A dual-agent memory architecture that outperforms long-context LLMs

Memory
For all their supernatural power, today’s AI models suffer from a surprisingly human flaw: they forget. Give an AI assistant a long conversation, a multi-step reasoning task, or a project lasting several days, and eventually it will lose the formula. Engineers refer to this phenomenon as “context rot,” and it has quietly become one of the most significant hurdles in building AI agents that can function reliably in the real world.

A research team from China and Hong Kong believes it has created a solution to context rot. Their new paper introduces General Agentic Memory (GAM)A system designed to preserve long-horizon information without affecting the model. The basic premise is simple: divide memory into two specialized roles, one that captures everything, the other that retrieves exactly the right things at the right time.

The early results are encouraging and the timing couldn’t be better. As the industry moves away from rapid engineering and embraces the broader discipline of context engineering, GAM is emerging at just the right moment.

When large context windows are still not enough

At the core of every large language model (LLM) is a hard boundary: a fixed “working memory”, commonly known as a context window. Once a conversation becomes longer, older information is shortened, summarized or quietly deleted. This limitation has long been recognized by AI researchers, and as of early 2023, developers have been working to expand the context window, exponentially increasing the amount of information a model can handle at once.

Mistral’s Mixtral started with an 8x7B 32K-token window, which is about 24 to 25 words, or about 128 characters in English; Essentially a small amount of text, such as a sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which more than doubled that capability; This was followed by Google’s Gemini 1.5 Pro and Anthropic’s Cloud 3, which offer 128K and 200K windows, both of which can be expanded to an unprecedented one million tokens. Even Microsoft joined the effort, moving from the previous Phi model’s 2K-token limit to the Phi-3’s 128K reference window.

Increasing the context window may seem like the obvious solution, but it is not. Even models with huge 100K-token windows, large enough to hold hundreds of pages of text, still struggle to recall buried details at the beginning of a long conversation. Scaling context comes with its own set of problems. As signals become longer, models become less reliable in detecting and interpreting information as attention to distant tokens weakens and accuracy gradually decreases.

Long inputs also weaken the signal-to-noise ratio, as including every possible detail may actually lead to worse responses than using focused prompts. Long signals also slow down the model; More input tokens leads to significantly higher output-token latency, creating a practical limit on how many contexts can be used before performance is affected.

memories are precious

For most organizations, supersized reference windows come with an obvious downside – they’re expensive. Sending massive signals via API is never cheap, and because pricing happens directly with input tokens, even a bloated request can drive up expenses. Quick caching helps, but not enough to balance the habit of regularly overloading models with unnecessary context. And this is the tension at the heart of the issue: memory is necessary to make AI more powerful.

As the context window expands to hundreds of thousands or even millions of tokens, the financial overhead grows exponentially. Scaling context is both a technical challenge and an economic challenge, and relying on large windows becomes an untenable strategy for long-term memory.

Improvements like summarization and retrieval-augmented generation (RAG) are also not a good thing. Summaries inevitably strip out subtle but important details, and traditional RAG, while robust on static documents, breaks down when information spans multiple sessions or evolves over time. Even newer variants, such as Agentic RAG and RAG 2.0 (which perform better in handling the retrieval process), still inherit the same fundamental flaw of treating retrieval as a solution rather than considering memory as the main problem.

Compilers solved this problem decades ago

If memory is the real bottleneck, and recovery can’t fix it, the difference requires a different type of solution. This is the stake behind GAM. Rather than pretend retrieval, GAM keeps a complete, lossless record and layers smart, on-demand recall on top of it, reproducing the exact details an agent needs, even as the conversation twists and evolves. A useful way to understand GAMs is through a familiar idea from software engineering: just-in-time (JIT) compilation. Instead of pre-computing a rigid, highly compressed memory, GAM keeps things light and agile by storing a minimal set of signals along with a complete, untouched collection of raw history. Then, when a request comes in, it immediately “compiles” a simulated context.

This JIT approach is built into GAM’s dual architecture, which allows the AI ​​to carry context in lengthy conversations without being overly stressed or making too many pre-guessing about what matters. The result is accurate information, delivered at exactly the right time.

Inside the GAM: A two-agent system built for memory that persists

GAM revolves around the simple idea of ​​separating the act of memorizing from remembering, which appropriately involves two components: the ‘memorizer’ and the ‘researcher’.

Memorization: Perfect recall without overload

Memorizer captures every exchange in its entirety, silently turning each interaction into a concise memo, while preserving the entire, decorated session in a searchable page store. It doesn’t aggressively compress or guess what’s important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Seriously, every detail is preserved, and nothing is thrown away.

Researcher: A Deep Retrieval Engine

When the agent needs to act, the researcher takes the lead in planning a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and piecing the pieces together. It performs layered searches in the page-store, blending vector retrieval, keyword matching, and direct lookups. It evaluates the findings, identifies shortcomings and continues searching until it has enough evidence to provide a reliable answer, much like a human analyst reviews old notes and primary documents. It iterates, explores, integrates and reflects until it creates a clean, task-specific briefing.

The power of GAMs comes from this JIT memory pipeline, which gathers rich, task-specific context on demand rather than relying on brittle, pre-computed summaries. Its core innovation is simple yet powerful, as it keeps all information intact and makes every detail recoverable.

Ablation studies support this view: conventional memory fails on its own, and naive retrieval is not sufficient. It is the pairing of an exhaustive collection with an active, iterative research engine that enables GAM to uncover details that other systems leave behind.

Better performing RAG and long-context models

To test the GAM, the researchers pitted it against standard RAG pipelines and models with increased context windows such as GPT-4o-mini and Qwen2.5-14B. They evaluated GAM using four major long-context and memory-intensive benchmarks, each chosen to test a different aspect of the system’s capabilities:

  • Lokomo Measures an agent’s ability to retain and recall information across long, multi-session interactions, including single-hop, multi-hop, temporal reasoning, and open-domain tasks.

  • HotpotQAA widely used multi-hop QA benchmark, derived from Wikipedia, was optimized using a memory-stress-test version of MemAgent, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens – ideal for testing how well a GAM handles noisy, huge inputs.

  • ruler Evaluates retrieval accuracy, multi-hop state tracking, aggregation and QA performance on long sequences under a 128K-token context to investigate long-horizon logic.

  • NarrativeQA There is a benchmark where each question must be answered using the full text of a book or movie script; The researchers sampled 300 instances with an average context size of 87K tokens.

Together, these datasets and benchmarks allowed the team to assess both GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.

GAM came out ahead in all benchmarks. Its biggest win was on Ruler, the standard for long-range state tracking. Notably:

  • GAM exceeded 90% accuracy.

  • RAG collapsed because key details were lost in the summary.

  • Long-context models faltered because old information effectively “faded” even when technically present.

Clearly, larger context windows are not the answer. GAM works because it retrieves tokens by exactness rather than accumulating them.

GAM, context engineering and competitive approaches

Poorly structured context, not model limitations, is often the real reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that correct information can always be retrieved, even over great distances. The emergence of the technique coincides with the current, broader shift in AI toward context engineering, or the practice of shaping everything that is seen by an AI model – its instructions, history, retrieved documents, tools, preferences, and output formats.

Reference engineering has increasingly eclipsed the importance of memory engineering, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. Another group of Chinese researchers have proposed a “semantic operating system” built on the basis of lifelong adaptive memory.

However, GAM’s philosophy is different: avoid losses and recover intelligently. Instead of guessing what will matter later, it keeps everything and uses a dedicated research engine to find relevant pieces at runtime. For agents handling multi-day projects, ongoing workflows or long-term relationships, reliability may prove essential.

Why does GAM matter in the long run?

Just as adding more computation does not automatically generate better algorithms, expanding the context window alone will not solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. Rather than relying on large models, huge reference windows, or endlessly sophisticated signals, it treats memory as an engineering challenge—a challenge that benefits from structure rather than brute force.

As AI agents transition from clever demos to mission-critical tools, their ability to remember long histories becomes critical to developing trustworthy, intelligent systems. Enterprises need AI agents that can track emerging tasks, maintain continuity, and recall past interactions with accuracy and precision. GAMs offer a practical path toward that future, hinting at what could be the next major frontier in AI: not large models, but smart memory systems and the reference architectures that make them possible.



<a href

Leave a Comment