LLMs Corrupt Your Documents When You Delegate

April 17, 20262604.15597

Philippe Laban, Tobias Schnabel, Jennifer Neville

cs.CLcs.HC

TLDR

LLMs silently corrupt documents by introducing severe errors during long delegated workflows, degrading content by an average of 25% in frontier models.

Key contributions

Introduced DELEGATE-52, a benchmark for LLM readiness in delegated document editing across 52 domains.
Found that 19 LLMs, including frontier models, corrupt an average of 25% of document content.
Degradation worsens with document size, interaction length, and presence of distractor files.
Agentic tool use did not improve LLM performance on delegated tasks.

Why it matters

This paper reveals a significant challenge for LLMs in delegated knowledge work: their tendency to silently corrupt documents. It underscores the need for improved reliability before LLMs can be fully trusted with complex editing tasks, impacting future AI system development.

Original Abstract

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers