ArXiv TLDR

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

🐦 Tweet
2604.25862

Leon Kogler, Stefan Hangler, Maximilian Ehrhart, Benedikt Dornauer, Roland Wuersching + 1 more

cs.SEcs.AI

TLDR

RESTestBench is a new benchmark and metric for evaluating LLM-generated REST API tests from natural language requirements.

Key contributions

  • Introduces RESTestBench, a benchmark with 3 REST services and verified NL requirements (precise/vague).
  • Proposes a novel requirements-based mutation testing metric for functional fault detection.
  • Evaluates LLM-generated tests (non-refinement vs. SUT-guided refinement) across state-of-the-art LLMs.
  • Shows test effectiveness drops significantly when LLMs interact with faulty code, especially for vague requirements.

Why it matters

This paper fills a critical gap in evaluating LLM-generated REST API tests, moving beyond traditional code coverage. It provides a standardized benchmark and a novel metric, essential for advancing functional testing of APIs. The findings offer key insights into the limitations of SUT interaction for test refinement, guiding future research.

Original Abstract

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.