MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot

Recursive language model
recursive language model (RLMs) is an estimation technique developed by researchers at MIT CSAIL that treats long signals as the external environment of the model. Instead of putting the entire prompt into the model’s context window, the framework allows the LLM to programmatically examine, decompose, and call itself recursively on snippets of text.

Instead of expanding the context window or summarizing old information, the MIT team reframes the long-context argument as a system problem. By allowing models to treat signals as something they can observe with code, recursive language models allow LLMs to reason over millions of tokens without retraining. It offers enterprises a practical path to long-term tasks like codebase analysis, legal review, and multi-step reasoning that routinely break today’s models.

Because the framework is designed as a wrapper around existing models, it can serve as a drop-in replacement for applications that call LLMs directly.

llm reference problem

While frontier models are becoming increasingly sophisticated in reasoning, their ability to process large amounts of information is not increasing at the same rate. This constraint is driven by two different limits: a hard physical constraint on how much text a model can process at once (context length) and "Context rot."

The researchers argue that the challenge is whether it is possible to scale the effective context size of general-purpose LLMs by orders of magnitude without retraining them. This capability is becoming increasingly important for enterprise applications, where LLMs are adopted for long-horizon tasks requiring the processing of millions of tokens – a challenge Zhang argues cannot be solved simply by expanding the context window.

"There is an entropy argument which means that as you increase the size of the effective context window, you need exponentially more data samples," Alex Zhang, co-author of the paper, told VentureBeat.

Current approaches to spanning context often rely on contraction, where the model summarizes older parts of the interaction to free up space. However, this method fails for tasks that require random access to specific details located in the first parts of the prompt.

How do RLMs work?

The concept behind RLM is derived from "outer core" Algorithms used in classical computing. These algorithms are designed to process datasets too large to fit in the computer’s main memory by keeping the data on the hard drive and fetching only the necessary portions as needed.

RLMs apply this logic to generative AI. Instead of feeding long prompts directly into the neural network, the framework loads the text as a string variable inside the Python coding environment. In LLM general reference is made about data (such as total character count) but it is not "Look" Text at the beginning.

Once the prompt is stored as a variable, the LLM acts as a programmer. It writes Python code to interact with external variables, using standard commands to look into the data. For example, the model can use regular expressions to find specific keywords "Chapter 1" Or "financial result."

When code execution finds a relevant snippet, RLM pulls only that specific fragment into its active context window for analysis.

For example, if the prompt is a huge book, LLM could write a loop that identifies chapter boundaries and then triggers a subcall to summarize each chapter individually.

The architecture typically involves two agents. A "basic language model," Often a capacity-heavy model, such as GPT-5, acts as the orchestrator. It plans the approach, writes the code, and manages data flow within the REPL environment. A "recursive language model," Often serves as a fast and cheap model, worker. Route LM calls this worker to process specific text snippets separated by code.

Because the prompt resides in the environment’s memory rather than in the model’s context window, the system can handle inputs far larger than the model’s training range. Importantly, for the end user, the RLM behaves exactly like a standard model: it accepts a string and returns an answer. This allows enterprise teams to swap out standard API calls for RLM.

For developers willing to experiment, the RLM code is currently available GitHub.

"A major argument for RLM is that most complex tasks can be decomposed into smaller, ‘local’ subtasks," Zhang said. "However, how to decompose this context/problem is not trivial, and the model must be able to do this."

RLM in action

To validate the framework, the researchers tested the RLM against the base model and other agentive approaches such as CodeAct and Summarization agents in various long-context tasks, including retrieval and multi-hop question answering.

The results demonstrated strong performance gains at 10 million+ token scale. But browsecomp-plusIn a benchmark with an input of 6 to 11 million tokens, the standard base model failed completely, scoring 0%. In contrast, driven by the RLM GPT-5 Achieved a score of 91.33%, outperforming Summary Agent (70.47%) and codeact (51%).

The framework also performed excellently on tasks with high computational complexity. On OOLONG-Pairs, an information-dense logic benchmark where difficulty scales quadratically with input length, the base GPT-5 models failed catastrophically with a score of only 0.04%. RLM achieved an F1 score (a balanced measure of precision and recall) of 58%, demonstrating its emerging capabilities to handle the dense tasks that cripple standard models. Similarly, on code understanding tasks (CodeQA benchmark), RLM more than doubled the performance of the base GPT-5 model, increasing from 24% to 62%.

Regarding the context decay problem, the data showed that while the performance of the base GPT-5 degrades rapidly as task complexity increases, the RLM performance remained stable, consistently outperforming the base model on contexts longer than 16,000 tokens.

Despite the increasing complexity of workflow, RLMs often maintain comparable or lower average costs than the baseline. On the BrowseComp-Plus benchmark, the RLM summary was up to three times cheaper than the baseline.

However, the researchers noted that while the average costs are low, RLM trajectories are "Long tailed." External runs can be costly if the model gets stuck in loops or performs unnecessary validation. While GPT-5 was conservative in its sub-calls, open-source Qwen3-Coder The model sometimes attempted thousands of sub-calls for simple tasks.

"Today, you probably have to implement your own guardrails and logic to control RLM behavior," Zhang said. However, they predict that future models could be trained to manage their own compute budgets more effectively. Companies like Prime Intellect are planning Integrate RLM In the training process of models, potentially addressing edge cases where the model’s estimation budget increases.

For enterprise architects deciding where to place their bets, the RLM framework provides a new tool for tackling information-dense problems.

"I think RLMs are still extremely useful for chatbots (think long chat histories), but ultimately they argue for an alternative way of using LMs," Zhang said. "I think RLMs work in conjunction with standard recovery methods like RAG; They do not serve as replacements, and can be used in different settings or together."



<a href

Leave a Comment