ArXiv TLDR

Multi-Agent Systems for Root Cause Analysis in Microservices

🐦 Tweet
2605.03505

Alexander Naakka, Yuqing Wang, Mika V Mäntylä

cs.SE

TLDR

LATS-RCA is a multi-agent LLM framework that uses reflection-guided tree search to automate root cause analysis in complex microservice systems.

Key contributions

  • Introduces LATS-RCA, an LLM-based multi-agent framework for microservice root cause analysis.
  • Employs a reflection-guided tree-structured search, with agents reasoning over logs and metrics.
  • Achieves high diagnostic accuracy on Light-OAuth2 and benchmarks computational costs.
  • Validated in a complex production environment, revealing real-world challenges and applicability.

Why it matters

This paper introduces LATS-RCA, a multi-agent LLM framework for automated root cause analysis in microservices. It improves diagnostic accuracy using a reflection-guided tree search. Validated in a production environment, it demonstrates practical applicability and highlights real-world operational challenges.

Original Abstract

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small-team Java system with a homogeneous technology stack. The results on LO2 show that LATS-RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS-RCA in real-world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi-factor root-causes by production-scale MSS.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.