INSUBCONTINENT EXCLUSIVE:

the authorInstitutions: Penn State University, Duke University, Google DeepMind, University of Washington, Meta, Nanyang Technological

University, and Oregon State University

The co-first authors are Shaokun Zhang of Penn State University and Ming Yin of Duke University.In recent years, LLM Multi-Agent systems

have garnered widespread attention for their collaborative approach to solving complex problems

This leaves developers with a critical question: which agent, at what point, was responsible for the failure? Sifting through vast

familiar frustration for developers

In increasingly complex Multi-Agent systems, failures are not only common but also incredibly difficult to diagnose due to the autonomous

nature of agent collaboration and long information chains

researchers from Penn State University and Duke University, in collaboration with institutions including Google DeepMind, have introduced

and have developed and evaluated several automated attribution methods

This work not only highlights the complexity of the task but also paves a new path toward enhancing the reliability of LLM Multi-Agent

systems.The paper has been accepted as a Spotlight presentation at the top-tier machine learning conference, ICML 2025, and the code and

dataset are now fully

open-source.Paper?https://arxiv.org/pdf/2505.00212Code?https://github.com/mingyin1/Agents_Failure_AttributionDataset?https://huggingface.co/

many domains

However, these systems are fragile; errors by a single agent, misunderstandings between agents, or mistakes in information transmission can

lead to the failure of the entire task.Currently, when a system fails, developers are often left with manual and inefficient methods for

debugging:Manual Log Archaeology : Developers must manually review lengthy interaction logs to find the source of the problem.Reliance on

reliability

There is an urgent need for an automated, systematic method to pinpoint the cause of failures, effectively bridging the gap between

challenges above:1

This task is defined by identifying the 2

This dataset includes a wide range of failure logs collected from 127 LLM Multi-Agent systems, which were either algorithmically generated

or hand-crafted by experts to ensure realism and diversity

Each failure log is accompanied by fine-grained human annotations for:Who: The agent responsible for the failure.When: The specific

interaction step where the decisive error occurred.Why: A natural language explanation of the cause of the failure.3

automated failure attribution:All-at-Once: This method provides the LLM with the user query and the complete failure log, asking it to

identify the responsible agent and the decisive error step in a single pass

While cost-effective, it may struggle to pinpoint precise errors in long contexts.Step-by-Step: This approach mimics manual debugging by

having the LLM review the interaction log sequentially, making a judgment at each step until the error is found

It is more precise at locating the error step but incurs higher costs and risks accumulating errors.Binary Search: A compromise between the

first two methods, this strategy repeatedly divides the log in half, using the LLM to determine which segment contains the error

FindingsExperiments were conducted in two settings: one where the LLM knows the ground truth answer to the problem the Multi-Agent system is

trying to solve (With Ground Truth) and one where it does not (Without Ground Truth)

The primary model used was GPT-4o, though other models were also tested

The systematic evaluation of these methods on the Who&When dataset yielded several important insights:A Long Way to Go: Current methods are

far from perfect

Even the best-performing single method achieved an accuracy of only about 53.5% in identifying the responsible agent and a mere 14.2% in

pinpointing the exact error step

excel at different aspects of the problem

Search method provides a middle-ground performance.Hybrid Approaches Show Promise but at a High Cost: The researchers found that combining

different methods, such as using the All-at-Once approach to identify a potential agent and then applying the Step-by-Step method to find

the error, can improve overall performance

However, this comes with a significant increase in computational cost.State-of-the-Art Models Struggle: Surprisingly, even the most advanced

reasoning models, like OpenAI o1 and DeepSeek R1, find this task challenging

This highlights the inherent difficulty of automated failure attribution, which demands a higher level of reasoning than what is required

for more conventional tasks.The Importance of Explicit Reasoning: Providing explicit prompts that require the LLM to explain its reasoning

in the All-at-Once and Step-by-Step methods was shown to improve performance

Context Length is a Limiting Factor: The study also revealed that as the context length of the failure logs increases, the performance of

all attribution methods tends to decrease, with a more pronounced impact on the accuracy of identifying the error step.Like

this:LikeLoading...

Researchers from PSU and Duke introduce �Multi-Agent Systems Automated Failure Attribution