POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
TLDR
This paper introduces POSTCONDBENCH, a new benchmark for evaluating LLM-generated program postconditions for both correctness and completeness.
Key contributions
- Introduces POSTCONDBENCH, a multilingual benchmark for method-level postcondition generation.
- Comprises 420 Python/Java tasks from 121 real-world open-source projects with expert-curated ground truth.
- Enables automatic evaluation using a runnable environment and operationalizes completeness via defect discrimination.
- Evaluates five SOTA LLMs, revealing a gap between correctness and completeness, worsened by complexity.
Why it matters
Current LLM benchmarks for postcondition generation often lack robust completeness evaluation. POSTCONDBENCH provides a crucial, real-world benchmark with automated defect discrimination, enabling better assessment and development of reliable LLM tools for program verification.
Original Abstract
Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating postconditions from code and natural-language artifacts using large language models (LLMs). However, evaluation remains a key bottleneck. Existing benchmarks primarily emphasize correctness under limited evaluation settings, often relying on surface-form matching or manual assessment on small or synthetic datasets. We introduce POSTCONDBENCH, a multilingual benchmark for evaluating method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic evaluation, POSTCONDBENCH provides a runnable execution environment and operationalizes completeness via defect discrimination: a postcondition set is more complete if it is violated by more defective implementations while remaining satisfied on correct executions. Using POSTCONDBENCH, we formulate three generation settings and evaluate five SOTA LLMs. Our results reveal a substantial gap between correctness and completeness, and show that repository-level dependencies and method complexity exacerbate this gap.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.