
As larger language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide finished results. But how much can you trust the model to remain faithful to the content of your documents when it has to iterate over them over multiple rounds?
A new study by Microsoft researchers shows that large language models silently corrupt the documents they operate on and produce errors. Researchers have developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time.
Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distracting documents actually impairs their performance.
This serves as a reminder that although there is increasing pressure to automate knowledge tasks, current language models are not completely reliable for these tasks.
Mechanics of Delegated Work
Microsoft’s study focuses on “delegated work”, an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents.
A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflow extends far beyond programming into other domains. In accounting, for example, a user can supply a dense bookkeeping file and instruct the model to split the document into separate files organized by specific expense categories.
Because users may lack the time or specialized expertise to manually review each modification applied by AI, delegation often relies on trust. Users expect the model to faithfully complete tasks without uncontrolled errors, unauthorized deletions, or hallucinations in documents.
To measure how much AI systems can be trusted in extended, iterative delegated workflows, researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation.
Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Along with the seed document, the environment includes five to ten complex, non-trivial editing tasks.
Grading is a complex, multi-step editing process that typically requires expensive human review. DELEGATE-52 circumvents this by using a “round-trip relay” simulation method that evaluates answers without the need for human-annotated reference solutions. This approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is asked to translate a document from one language to another and see how well it reproduces the original version.
Accordingly, every editing function in DELEGATE-52 is designed to be completely reversible, linking the forward instruction with its exact inverse. For example, an instruction to split the ledger into separate files based on expense category is combined with an instruction to merge all the category files back into a single ledger.
In comments provided to VentureBeat, Philip Laban, senior researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether AI can hit "Undo" Because human workers cannot be forced to immediately "Forget" One task they just did, this round-trip evaluation, is uniquely suited to AI. By starting a new conversation session, the researchers force the model to attempt the reversal task completely independently.
The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design," Laban explained. "They are trying to do every task as perfectly as possible at every stage."
These roundtrip tasks are chained together in a continuous relay to simulate a long-horizon workflow spanning up to 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distraction files into the context of each task. These include 8,000 to 12,000 tokens of documents related to the topic but completely irrelevant. Distractions measure whether the AI can maintain focus or whether it gets confused and pulls in the wrong data.
Testing the Frontier Model in Relay
To understand how different architectures and scales handle the assigned task, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, XAI, and Moonshot. The main experiment subjected these models to the simulation of 20 consecutive editing interactions.
Across all models, documents suffered an average of 50% degradation by the end of the simulation. Even the best Frontier models in the experiment, specifically Gemini 3.1 Pro, Cloud 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content.
Of the 52 professional domains, Python was the only domain where the majority of models achieved ready status with a score of 98% or higher. The models excel at programmatic tasks, but struggle severely in natural language and niche domains like fiction, earnings statements or recipes. The overall top model, the Gemini 3.1 Pro, was deemed ready for delegated work in only 11 of the 52 domains.
Interestingly, the corruption was not caused by thousands of cuts, where models gradually accumulate small errors. Instead, about 80% of the total degradation is caused by rare but largely critical failures, which are single interactions where a model suddenly drops at least 10% of the document’s content. Marginal models do not necessarily avoid small errors better. They simply postpone these catastrophic failures until later.
Another important observation is that when weak models fail, their degradation arises primarily from material deletion. However, when Frontier models fail, they actively corrupt existing material. The text is still there, but it has been subtly distorted or confused, making it very difficult for a human observer to detect the error.
Interestingly, arming the models with the usual tools for code execution and file read/write access actually made their performance worse, resulting in an average of 6% more degradation. Laban pointed out that the failure lies in relying on generic tools rather than domain-specific tools.
"The models lack the ability to write immediately effective programs that can manipulate files in different domains without making mistakes," He noted. "When they can’t do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone." The solution is for developers to build tightly scoped tools (such as specific tasks to count or move entries within .ledger files) to keep agents on track.
The degradation also increases as documents grow larger or more distracting files are added to the workspace. For enterprise teams investing heavily in recovery-augmented generation (RAG), these distracting documents serve as a direct warning about the compounding costs of dirty context. While a noisy reference window may cause a minimal 1% performance degradation after only two interactions, this degradation leads to a massive 2–8% degradation over the course of longer simulations.
"For the recovery community: RAG pipelines should be evaluated on multi-step workflows, not just single-turn recovery benchmarks," Laban said. "Single-turn measurements systematically underestimate the loss of precision recovery."
Reality check for autonomous enterprise
The findings of the DELEGATE-52 benchmark provide an important reality check to the current hype around fully autonomous AI agents.
The design of the benchmark also prompts a practical constraint: because models can maintain a clean record for many steps before sudden catastrophic failure, incremental human review is necessary – not a single final check. Laban recommends building AI applications around small, transparent tasks rather than complex long-horizon agents. This leaves the action implication intact without the author giving a prescription.
For organizations looking to securely deploy autonomous agents today, the DELEGATE-52 methodology provides a practical blueprint for testing data pipelines in-house. Laban explained to him "… An enterprise team wishing to adopt this framework needs to build three components: (a) a set of reversible editing functions representative of their workflow, (b) a parser that converts their domain documents into a structured representation, and (c) an equality function that compares the two parsed representations." Teams don’t even need to build the parser from scratch. The Microsoft research team successfully reused existing parsing libraries for 30 of the 52 domains tested.
Laban is optimistic about the rate of improvement. "The progress is real and fast. Looking at the GPT family alone, the models go from scores below 20% to nearly 70% in 18 months," Laban said. "If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52."
However, Laban cautioned that DELEGATE-52 is purposefully small compared to large-scale enterprise environments. Even if foundation models essentially master this benchmark, the endlessly long tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.
<a href