ArXiv TLDR

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

🐦 Tweet
2604.21889

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di + 1 more

cs.CLcs.AIcs.LG

TLDR

TingIS is an enterprise-scale system using LLMs and noise reduction to discover real-time risk events from noisy customer incidents with high accuracy.

Key contributions

  • Multi-stage event linking engine uses LLMs and indexing for accurate incident merging.
  • Cascaded routing mechanism ensures precise business attribution of incidents.
  • Multi-dimensional noise reduction pipeline integrates domain knowledge and statistical patterns.
  • Achieves 95% discovery rate for high-priority incidents with 3.5 min P90 latency.

Why it matters

TingIS offers a robust solution for real-time risk event discovery from complex customer incident data. By combining LLMs with efficient noise reduction, it significantly improves incident detection and routing accuracy in large-scale cloud environments, crucial for minimizing downtime and maintaining user trust.

Original Abstract

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.