CommunityNews

CommunityNews

LLMs Corrupt Your Documents When You Delegate

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Read in full here:

Where Next?

Popular Ai topics Top

First poster: CommunityNews
AI models are increasingly applied in high-stakes domains like health and conservation. Data quality carries an elevated signifi- cance i...
New
First poster: CommunityNews
The use of facial recognition for surveillance, or algorithms that manipulate human behaviour, will be banned under proposed EU regulatio...
New
First poster: CommunityNews
Many recent big advances in tech have one key thing at the heart of then: artificial intelligence.
New
First poster: bot
A research group has taught AI to magnetically wrangle a high-powered stream of plasma used for fusion research — but wait! Put away your...
New
First poster: bot
Technique could allow high-quality calls and music on low-quality connections.
New
First poster: mercyf
It replicates your development environment and takes up to 30 minutes per task.
New
CommunityNews
I run Claude Code with --dangerously-skip-permissions flag, giving it full system access. Let me show you a new way of approaching comput...
New
New
CommunityNews
Contracted AI raters describe grueling deadlines, poor pay and opacity around work to make chatbots intelligent
New
CommunityNews
1 skill, 17 commands, and curated anti-patterns for impeccable frontend design. Works with Cursor, Claude Code, Gemini CLI, and Codex CLI...
New

Other popular topics Top

Exadra37
Please tell us what is your preferred monitor setup for programming(not gaming) and why you have chosen it. Does your monitor have eye p...
New
PragmaticBookshelf
From finance to artificial intelligence, genetic algorithms are a powerful tool with a wide array of applications. But you don't need an ...
New
AstonJ
We have a thread about the keyboards we have, but what about nice keyboards we come across that we want? If you have seen any that look n...
New
AstonJ
This looks like a stunning keycap set :orange_heart: A LEGENDARY KEYBOARD LIVES ON When you bought an Apple Macintosh computer in the e...
New
PragmaticBookshelf
Learn different ways of writing concurrent code in Elixir and increase your application's performance, without sacrificing scalability or...
New
rustkas
Intensively researching Erlang books and additional resources on it, I have found that the topic of using Regular Expressions is either c...
New
AstonJ
We’ve talked about his book briefly here but it is quickly becoming obsolete - so he’s decided to create a series of 7 podcasts, the firs...
New
CommunityNews
A Brief Review of the Minisforum V3 AMD Tablet. Update: I have created an awesome-minisforum-v3 GitHub repository to list information fo...
New
PragmaticBookshelf
A concise guide to MySQL 9 database administration, covering fundamental concepts, techniques, and best practices. Neil Smyth MySQL...
New
PragmaticBookshelf
Use advanced functional programming principles, practical Domain-Driven Design techniques, and production-ready Elixir code to build scal...
New