ArXiv TLDR

Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?

🐦 Tweet
2604.26670

Runzhou Wang, Shenglin Zhang, Wenwei Gu, Yongxin Zhao, Chenyu Zhao + 3 more

cs.SE

TLDR

NexusRCL improves microservice root cause localization by modeling heterogeneous fault propagation between services and hosts.

Key contributions

  • Analyzes how entity-level heterogeneity (services, hosts) drives asymmetric fault propagation in microservices.
  • Identifies cross-layer interactions between services and hosts as key drivers of fault behavior.
  • Introduces NexusRCL, a semi-supervised framework using heterogeneous graphs for root cause localization.
  • NexusRCL improves Top-1 accuracy by up to 49.85% over SOTA on industrial microservice benchmarks.

Why it matters

Microservice root cause localization is vital but complex due to heterogeneity. This paper reveals how entity-level distinctions drive asymmetric fault propagation, proposing NexusRCL to improve diagnostic accuracy for cloud-native systems.

Original Abstract

Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised framework that internalizes these propagation patterns by formalizing services and hosts as distinct node types within a heterogeneous graph. This design, coupled with an event-based abstraction mechanism, allows NexusRCL to effectively capture both data level and entity-level heterogeneity while minimizing labeling costs through active learning. Comprehensive evaluations on two industrial benchmark datasets demonstrate NexusRCL's superior performance, achieving improvements of up to 49.85\% in Top-1 accuracy (A@1) and 32.70\% in Average Top-5 accuracy (A@5) compared to state-of-the-art methods.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.