Artificial intelligence is rapidly becoming an integral part of modern workplaces, functioning not just as a tool but increasingly as a collaborative “co-worker.” From drafting emails to editing complex documents and writing code, AI-powered systems are transforming productivity.
However, a recent study by Microsoft Research highlights a critical downside: excessive reliance on these systems may undermine the very quality of work they aim to enhance.
The concept of AI delegation is simple yet powerful. Instead of manually executing tasks, users provide instructions and allow AI systems to perform them. This approach, often referred to as “delegated work” or “vibe coding,” is gaining traction across industries due to its efficiency and scalability.
However, this efficiency depends heavily on trust. As researchers noted,
“Delegation requires trust – the expectation that the LLM will faithfully execute the task without introducing errors into documents,”
This assumption, the study suggests, may not yet hold true in real-world scenarios.
The research reveals a concerning trend: large language models (LLMs) tend to degrade document quality when tasked with repeated edits. According to the findings:
“In some cases, even the most advanced models “corrupt an average of 25% of document content” after extended use,”
Using a benchmark called DELEGATE-52, researchers evaluated 19 AI models across 52 professional domains, including coding, accounting, music notation, and textile design.
The results were striking:
"Our findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing on average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%,"
One of the most alarming aspects of the study is how AI errors manifest. Rather than obvious failures, systems tend to introduce subtle inaccuracies:
“These could be simple mistakes like a wrong number or a missing sentence. But when the document is edited repeatedly, the errors pile up and change the final output.”
Researchers described these issues as:
“sparse but severe errors that silently corrupt documents.”
Over time, these minor inconsistencies accumulate, leading to significant distortions in content quality.
The study highlights a fundamental limitation of current AI systems: they struggle with long, multi-step workflows. While performance may appear strong in short tasks, it deteriorates as interactions increase.
“Short-term performance… is not always predictive of long-horizon performance,” the researchers found.
This insight is particularly important because most professional tasks—such as report writing, financial documentation, and legal drafting—require multiple rounds of edits.
Interestingly, giving AI models access to additional tools—such as file editors or code execution environments—did not improve outcomes. In fact, performance slightly worsened.
The reason lies in complexity. Tool usage increases the volume of data the model must process, making it harder to maintain consistency across iterations.
The study also found that AI reliability varies significantly across domains:
Coding emerged as the only domain where most models could consistently handle delegated workflows effectively.
As organizations increasingly integrate AI into daily operations, the study raises important questions about oversight and accountability.
From drafting reports to managing internal documentation, many companies rely on AI with minimal human review. However, researchers caution against this approach:
“Current LLMs are unreliable delegates,”
They further warned:
“Users still need to closely monitor LLM systems as they operate,”
This is especially critical for high-stakes environments such as finance, healthcare, and legal services, where even minor errors can have significant consequences.
Despite the challenges, the study acknowledges that AI technology is evolving rapidly. Newer models show measurable improvements compared to earlier versions, even if they are not yet reliable enough for full delegation.
Organizations are now exploring hybrid workflows, where AI assists but humans remain in control—ensuring both efficiency and accuracy.
The findings from Microsoft Research serve as a timely reminder that while AI offers immense productivity gains, it is not infallible. The growing trend of delegating complex tasks to AI systems must be balanced with human oversight.
As AI continues to evolve, the challenge for businesses will be to strike the right balance—leveraging automation without compromising quality. In the current landscape, the message is clear: trust AI, but verify every step.