LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics

April 14, 20262604.12218

cs.LGcs.SE

TLDR

This paper benchmarks LLMs and traditional methods for log anomaly detection, revealing fine-tuned transformers excel and zero-shot LLMs perform well without labeled data.

Key contributions

Comprehensive benchmark of LLM-based and traditional log anomaly detection methods across four datasets.
Evaluated classical parsers + ML, fine-tuned transformers, and prompt-based LLMs in zero-shot/few-shot settings.
Fine-tuned transformers achieved highest F1 (0.96-0.99); zero-shot LLMs showed strong F1 (0.82-0.91) without labels.
Analyzed cost-accuracy, latency, and failure modes, offering actionable guidelines for practitioners.

Why it matters

This paper offers the first systematic benchmark of LLM-based and traditional log anomaly detection, filling a critical gap. It provides clear guidance for practitioners to select optimal solutions based on accuracy, cost, and data availability, crucial for maintaining reliable large-scale systems.

Original Abstract

System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data -- a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers