TORAI: Unsupervised Fine-grained RCA using Multi-Source Telemetry Data
Luan Pham, Huong Ha, Xiuzhen Zhang, Hongyu Zhang
TLDR
TORAI is an unsupervised method for fine-grained root cause analysis in microservice systems, effectively handling "blind spots" without relying on service call graphs.
Key contributions
- Addresses "blind spots" in microservice RCA where services lack traces or call graph data.
- Unsupervised approach that leverages multi-source telemetry data, eliminating the need for a service call graph.
- Employs severity measurement, clustering, causal analysis, and hypothesis testing for fine-grained root cause identification.
- Significantly outperforms baselines in systems with blind spots, pinpointing root causes in top-3 recommendations.
Why it matters
Existing RCA methods struggle with evolving microservice systems that contain "blind spots" lacking trace data. TORAI provides a crucial advancement by enabling accurate, fine-grained root cause analysis without a complete service call graph. This makes it highly practical for real-world, dynamic microservice environments.
Original Abstract
Existing multi-source root cause analysis (RCA) methods for microservice systems assume all services have traces to construct a service call graph. However, this assumption is not practical as microservice systems evolve rapidly and may contain blackbox services without traces, such as compiled software or unsupported services. We refer to these services as blind spots. In the presence of blind spots, the performance of existing multi-source RCA methods may be affected, as they only diagnose visible services on the call graph. To overcome this limitation, we propose TORAI, a novel unsupervised approach that effectively pinpoints fine-grained root causes without relying on the service call graph. Instead, TORAI first measures anomaly severity using available multi-source telemetry data. It then performs clustering to group services based on their severity symptoms and conducts causal analysis to rank services within each severity cluster. Finally, TORAI aggregates the cluster rankings and uses hypothesis testing to identify fine-grained root causes. TORAI provides an unsupervised approach that leverages available multi-source telemetry data for RCA without requiring a constructed service call graph or further intrusive actions, thus addressing the limitations of existing methods. Our experiments on three benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines remarkably in the presence of blind spots. Performance on real-world failures further shows that TORAI can accurately pinpoint the root causes in top-3 recommendations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.