ArXiv TLDR

E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

🐦 Tweet
2604.11094

Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Minghua He, Chiming Duan + 3 more

cs.SEcs.AI

TLDR

E2E-REME introduces an end-to-end model for autonomous microservice remediation, generating executable playbooks directly from diagnosis reports.

Key contributions

  • Introduces E2E-MR, a new task for end-to-end microservice auto-remediation.
  • Develops MicroRemed, a benchmark for rigorous evaluation of remediation systems.
  • Proposes E2E-REME, a model trained with experience-simulation reinforcement fine-tuning.
  • Achieves superior accuracy and efficiency over 9 LLMs in microservice repair.

Why it matters

This paper addresses the growing challenge of microservice failures by enabling autonomous remediation. It moves beyond prompt-based LLM solutions, offering a more accurate and efficient end-to-end approach. This significantly reduces manual intervention and system downtime.

Original Abstract

Contemporary microservice systems continue to grow in scale and complexity, leading to increasingly frequent and costly failures. While recent LLM-based auto-remediation approaches have emerged, they primarily translate textual instructions into executable Ansible playbooks and rely on expert-crafted prompts, lacking runtime knowledge guidance and depending on large-scale general-purpose LLMs, which limits their accuracy and efficiency. We introduce \textit{End-to-End Microservice Remediation} (E2E-MR), a new task that requires directly generating executable playbooks from diagnosis reports to autonomously restore faulty systems. To enable rigorous evaluation, we build \textit{MicroRemed}, a benchmark that automates microservice deployment, failure injection, playbook execution, and post-repair verification. We further propose \textit{E2E-REME}, an end-to-end auto-remediation model trained via experience-simulation reinforcement fine-tuning. Experiments on public and industrial microservice platforms, compared with nine representative LLMs, show that E2E-REME achieves superior accuracy and efficiency.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.