
One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, increases token costs, and makes it difficult to train the entire system as a cohesive unit.
To address this challenge, researchers at the University of Illinois at Urbana-Champaign and Stanford University developed RecursiveMASA framework that enables agents to collaborate and transmit information through embedding spaces rather than text. This change results in gains in both efficiency and performance.
Experiments show that RecursiveMAS achieves accuracy improvements in complex domains such as code generation, medical reasoning, and search, while also increasing inference speed and reducing token usage.
RecursiveMAS are significantly cheaper to train than standard full fine-tuning or LoRa methods, making it a scalable and cost-effective blueprint for custom multi-agent systems.
Challenges of improving multi-agent systems
multi-agent system Can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a major challenge is to enable the system to grow, improve, and adapt to different scenarios over time.
Hint-based optimization improves agent interactions by iteratively refining the shared context provided to agents. By updating signals, the system acts as a director, guiding agents to produce responses that are more aligned with the broader goal. The fundamental limitation is that the capabilities of the underlying model of each agent remain constant.
A more sophisticated approach is to train agents by updating the weights of the underlying models. It is difficult to train an entire system of agents because updating all parameters in multiple models is computationally non-trivial.
Even if an engineering team is committed to training their models, the standard method of agents communicating through text-based interactions creates major hurdles. Because agents rely on sequential text generation, this causes latency as each model has to wait for the previous model to prepare its text before beginning its own processing.
Forcing models to explain their intermediate logic token-by-token so that the next model can read it is highly inefficient. This severely increases token usage, increases computation costs, and massively slows down iterative learning throughout the system.
How does RecursiveMAS work?
Rather than trying to improve each agent as an isolated, standalone component, RecursiveMAS is designed to co-evolve and scale the entire multi-agent system as an integrated whole.
inspired by the outline recursive language model (RLM). In a standard language model, data flows linearly through a stack of different layers. In contrast, a recursive language model reuses a set of shared layers that process data and feed it back into itself. By looping calculations, the model can deepen its logic without adding parameters.
RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture that acts as a unified recursive system. In this setup, each agent acts like a layer in a recursive language model. Instead of generating text, agents pass their continuous latent representations to the next agent in the sequence, creating a looped hidden stream of information flowing through the system.
This covert manipulation continues through all the agents. When the last agent completes its processing, its latent outputs are sent directly back to the first agent, starting a new recursion round.
This structure allows the entire multi-agent system to interact, reflect, and refine its collective reasoning over multiple rounds in a completely latent space, with only the last agent generating a textual output in the final round. It is as if the agents are communicating telepathically as a unified whole and the final agent provides the final response in the form of text.
Latent Collaboration Architecture
To make continuous latent space collaboration possible, the authors introduce a special architectural component called RecursiveLink. It is a lightweight, two-layer module designed to propagate and refine the latent state of a model rather than forcing it to decode text.
The last-layer hidden states of a language model contain a rich, meaningful representation of its reasoning process. RecursiveLink is designed to preserve and propagate this high-dimensional information from one embedding space to another.
To avoid the cost of updating each parameter in many large language models, the framework keeps the model’s parameters constant. Instead, it optimizes the system by simply training the parameters of the RecursiveLink module.
To handle both internal logic and external communications, the system uses two forms of modules. Internal recursive links work inside an agent during its reasoning phase. It takes the model’s newly created embeddings and maps them directly back to its input embedding space. This allows the agent to continuously generate a stream of latent thoughts without generating separate text tokens.
The external recursive link acts as a bridge between the agents. Because agents in real-world systems may use different model architectures and sizes, their internal embedding spaces have completely different dimensions. The outer recursive link includes an additional layer that is designed to match embeddings from one agent’s hidden dimension with the embedding space of the next agent.
During training, firstly, the internal links are trained independently to warm up the thinking ability of each agent in continuous latent embeddings. Then, the system enters outer-loop training, where diverse, frozen models are linked together in a loop, and the system is evaluated based on the final text output of the final agent.
The only thing that gets updated in the training process is the recursive link parameters and the original model weights remain unchanged, as low-grade optimization (LoRa). Another advantage of this system is its effectiveness when you have multiple agents on top of the same backbone model.
If you have a multi-agent system where two agents are built on exactly the same base model acting in different roles, you do not need to load two copies of the model into your GPU memory, nor do you train them separately. The agents would share the same spinal cord as the brain and use recursive links as connective tissue.
RecursiveMAS in action
Researchers evaluated RecursiveMAS across nine benchmarks spanning mathematics, science and medicine, code generation, and search-based question answering. They built a multi-agent system using open-weight models including QUEN, Llama-3, Gemma3, and Mistral. These models were assigned roles to create different agent collaboration patterns such as sequential reasoning and mixing of experts.
RecursiveMAS were compared to baselines under similar training budgets, including advanced standalone models with LoRA or fully supervised fine-tuning, alternative multi-agent frameworks such as Mixture-of-Agents and TextGrad, and recursive baselines such as LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces agents to communicate explicitly through text.
RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest baseline across all benchmarks. It particularly excelled on logic-heavy tasks, outperforming text-based optimization methods such as TextGrad by 18.1% on AIME2025 and 13% on AIME2026.
Because it avoids generating text at every step, RecursiveMAS achieved a 1.2x to 2.4x end-to-end inference speedup. RecursiveMAS is also far more token efficient than the alternative. Compared with text-based recursive-textMAS, it reduces token usage by 34.6% in the first round of recursion, and by the third round, it achieves 75.6% token reduction. RecursiveMAS also proved to be significantly cheaper to train. Because it only updates the lightweight RecursiveLink module, which contains about 13 million parameters or about 0.31% of the trainable parameters of the frozen model, it requires the lowest peak GPU memory and cuts training costs by more than half compared to full fine-tuning.
adopt enterprise
The efficiency gains – lower token consumption, lower GPU memory requirements, and faster inference – are intended to make complex multi-step agent workflows viable in production environments without the compute overhead that limits enterprise agentic deployments. The researchers have released the code and trained model weights under the Apache 2.0 license.
<a href