Assessing REST API Test Generation Strategies with Log Coverage

April 8, 20262604.07073

Nana Reinikainen, Mika Mäntylä, Yuqing Wang

cs.SE

TLDR

This paper introduces log coverage metrics to assess REST API test generation strategies, finding LLMs like Claude Opus 4.6 outperform human tests and combine well.

Key contributions

Proposes three log coverage metrics (average, min, max) for black-box REST API test assessment.
Empirically evaluates EvoMaster, LLMs (Claude, GPT), and human-written tests on a microservice.
Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests.
Combining human-written tests with Claude Opus 4.6 significantly increases total log coverage.

Why it matters

Assessing black-box REST API tests is challenging. This paper introduces novel log coverage metrics to evaluate test generation strategies, revealing that LLMs like Claude Opus 4.6 can significantly improve test effectiveness. It also highlights the value of combining diverse test generation approaches to achieve broader coverage.

Original Abstract

Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with Claude Opus 4.6 tests increases total observed log coverage by 78.4% and 38.9% in human-written and Claude tests respectively. When combining Locust tests with EvoMaster the same increases are 30.7% and 76.9% and when using GPT-5.2-Codex 26.1% and 105.6%. This means that the generation strategies exercise largely distinct runtime behaviors. Our future work includes extending our study to multiple systems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers