Stanford's DeLM cuts multi-agent task costs 50% — without a central orchestrator

u7277289442 An empty stage. Chairs are assembled in a semi ci 99405c6d 5f82 4abd ad6e 602bc2800e88 1
One assumption behind today’s AI frameworks is that agents need a “boss” at the center; This orchestrator runs the show, routes requests, and makes sure the entire system doesn’t descend into chaos.

This assumption may be wrong, and the cost of carrying it may be measured in estimated dollars and coordination latency. A new Stanford framework called the Decentralized Language Model, or DELM, is built on the premise that agents can coordinate directly, without routing every update through a central controller.

DELM’s shared knowledge base serves as a “common communication substrate” so that agents can “merge, filter, and re-broadcast” each other’s verified progress without having to route it through the main agent, the framework’s co-developers Yuzhen Mao and Azalea Mirhosseini explain in a research paper.

This is a system that is not only possible, but in some cases desirable. “Agents can build on prior findings, avoid repeated failures, preserve constraints, and recover detailed evidence only when needed.”

Challenges of traditional multi-agent systems

In a typical centralized multi-agent system, a main agent breaks tasks into sub-tasks, delegates them to multiple sub-agents in parallel, waits for responses, merges and summarizes intermediate progress, then launches the next wave of commands based on the gathered context.

Although this is a natural way to measure LLM reasoning, the Stanford researchers argue that it scales poorly. Each useful discovery, partial discovery, and failure must be reported to the master agent, which then determines what information to merge and rebroadcast to the agents below it.

Mao and Mirhosseini write, “As the number of subtasks increases, this becomes a bottleneck in controller communication and integration.” Furthermore, the Chief Orchestrator may “undermine, remove, or distort” useful information, causing progress to be lost.

This bottleneck also occurs in long-context logic scenarios. Once it receives reports from sub-agents, a main agent will typically group related concepts, data points, and other content together in an unsupervised learning loop. It can then pre-assign these "evidence group" Before sub-agents know whether the content presented is actually relevant or whether it has been put together correctly.

When a sub-agent receives this insufficient context, it will inevitably get confused and return to the main agent, triggering another retrieval or delegation round. “This makes back-and-forth coordination slower, more iterative, and increasingly disrupted by an overloaded lead agent,” the researchers write.

What does DeLM address and how does it work

DeLM, in contrast, is built around parallel agents, a shared context, and a task queue.

Shared context is essentially a curated store of “gist” or information summaries that other agents may find useful. These include partial findings and documented failures as well as verified and evidence-based findings; They also point to the wide range of evidence that agents can obtain based on their specific tasks.

A task queue is a set of subsequent pending subtasks that agents can freely claim.

“Agents write concise, verified updates to a shared context that subsequent agents can read directly,” the researchers write. Useful findings, failures, and obstacles are stored as “shared problem states” rather than passing through a central controller.

The pipeline looks like this:

  • Initialization: The input is divided into different work units and added to a queue;

  • Parallel Execution: Agents work both independently and collaboratively, pursuing tasks and reading shared context as they progress.

  • Compression and Validation: The results are compressed into reusable “summaries” that are checked against supporting evidence. Only fully verified extracts are shared with the group.

  • Additional work (if necessary): When the queue is empty, the last agent to respond inspects all shared contexts to determine whether further work is needed.

  • Final Step: The final agent determines that no further steps are needed and returns the final answer.

“Agents exchange progress via shared state, claim ready tasks asynchronously, and scale more adaptively as the number of subtasks increases,” the researchers explain.

How does DELM perform in the wild?

With DeLM, agents can avoid unnecessary exploration; reusing and building on each other’s discoveries and failures; And focus on unresolved issues.

The framework can be particularly useful in software engineering test-time scaling, when models are given time to “think” to improve their reasoning and problem-solving capabilities. Different agents may explore their own hypotheses or pursue reasoning paths in parallel, sharing intermediate progress. An example is concurrent de-bugging.

DeLM is also suitable for long context reasoning and multi-document question-and-answer; Agents can examine their own evidence sets (collections of papers, codes, or other materials) at the same time, while maintaining a “global compact view” of the accumulated evidence.

Researchers argue that this makes agentic tasks more accurate and significantly cheaper. This is supported by its performance on real-world benchmarks: on SWE-Bench Verified – which evaluates how well AI models and agents solve real-world software engineering problems – it outperformed the strongest baseline by 10.5% and reduced the cost per task by nearly 50%.

But this may go beyond coding: On the LongBench‑v2 Multi‑Doc QA – which assesses the ability of DELMs to handle long-context, real-world problems – DELM accuracy was the highest among four model families, including GPT‑5.4, Cloud Sonnet, Gemini Flash, and DeepSeek‑v4‑Pro.

As Mao explained in detail on X, DELM outperforms other models on SWE-bench for several reasons.

First, agents share failures. In normal parallel running, when an agent goes down the wrong path, that failure remains private, and subsequent agents can waste time (and money) chasing the same dead end. But with DELM, failing hypotheses are written in a shared context.

“Later agents can read them as obstacles, avoid repeated exploration, and redirect their search toward more promising improvements,” Mao said.

Additionally, once verified, the constraints are immediately added to the agents’ shared context. This means that they become a binding shared state. Mao said, “Later agents inherit them, build around them and avoid repeating globally invalid simplifications.”

Importantly, DeLM keeps shared progress concise enough to be reused. It is revealable, meaning agents see brief summaries by default, but can choose to reveal them in more detailed summaries and raw evidence.

As the researchers note, providing all raw documents and traces gives agents the maximum amount of information, but may impact their reference window and ultimately increase costs.

Mao said, “If agents shared full traces, each worker would need to read long command histories, file dumps, failed edits, and intermediate logic, turning coordination itself into another long-context bottleneck.”

On the other hand, while it is cheaper to share brief summaries, important details and evidence may be lost, resulting in less credible arguments.

Therefore, Unfolding provides “coarse-to-fine” opt-in access. This can improve accuracy and cost.

Ultimately, with a framework like DeLM, agents can be more efficient because they are prevented from reading the same document repeatedly or re-running the same failed analysis; More effective because useful findings are transmitted in parallel threads; More robust because they only share verified claims.

For enterprise builders, DeLM challenges a core assumption: that every multi-agent workflow needs a central controller. SWE-Bench and LongBench-v2 results show that the decentralized model is not only theoretically cleaner – it is faster, more accurate, and costs about half as much.



<a href

Leave a Comment